最简单的多线程爬虫实例

写这个例子主要还是了解多线程的使用和运行方式，因为爬虫是用多线程的实现的典型应用场景，基本写爬虫的没有不用多线程的，因为太多的网页或内容你不可能一一去获取，如果爬的数据量太大而不去并发执行，那时间估计是人无法忍受的，如果对python了解多一些的小伙伴可能知道GIL, 全称Global Interpreter Lock, 也就是python的全局锁，这把锁是为了解决多线程之间数据完整性和状态同步的，也正因为这把锁的存在，python无法实现真正意义上的多线程，因为在某个时刻只能有一个线程在占用CPU资源，这才能保证数据的决对完整性，当然这是对cpytho来说的，像jpython就没有gIL, 所以如果Python运行在多核cpu上推荐用多进程模式，线程模式运用场景是IO密集型的，对于cpu计算密集型来说效率会降低，就爬虫程序来说，它就是一个典型的IO密集型操作，所以用Python最好不过，如果不知道上面那些这些内容也没关系，就记住在什么场景下使用就可以了，今天主要是给爬虫例子，代码如下：

# encoding=utf8

import threading
import requests
import sys
reload(sys)
sys.setdefaultencoding('utf8')


def save(html, file_absolute_path):
    with open(file_absolute_path, 'wb+') as file:
        file.write(html)
        file.flush()


def crawl(req):
    d = requests.get(req["host"])
    return d.text


class MyCrawler(threading.Thread):
    def __init__(self, req, file_path):
        threading.Thread.__init__(self, name="Crawler-{}".format(req["host"]))
        self.req = req
        self.file_path = file_path

    def run(self):
        html = crawl(self.req)
        save(html, self.file_path)


def __main__():
    continue_input = True
    threads = []
    while continue_input:
        host = raw_input("host: ")
        file_path = raw_input("output file absolute path: ")
        req = {"host": host}
        threads.append(MyCrawler(req, file_path))
        continue_input = raw_input("add another? (y/N) ") == "y"

    for t in threads:
        t.start()
        

__main__()

# encoding=utf8

import threading

import requests

import sys

reload(sys)

sys.setdefaultencoding('utf8')

def save(html, file_absolute_path):

with open(file_absolute_path, 'wb+') as file:

file.write(html)

file.flush()

def crawl(req):

d = requests.get(req["host"])

return d.text

class MyCrawler(threading.Thread):

def __init__(self, req, file_path):

threading.Thread.__init__(self, name="Crawler-{}".format(req["host"]))

self.req = req

self.file_path = file_path

def run(self):

html = crawl(self.req)

save(html, self.file_path)

def __main__():

continue_input = True

threads = []

while continue_input:

host = raw_input("host: ")

file_path = raw_input("output file absolute path: ")

req = {"host": host}

threads.append(MyCrawler(req, file_path))

continue_input = raw_input("add another? (y/N) ") == "y"

for t in threads:

t.start()

__main__()

这个脚本只是个简单例子，定义了一个数据保存函数save()，一个获取网页内容函数crawl()，模块用的第三方模块requests, MyCrawler类中调用父类的初始化方法，然后重新run函数，就是要把我们要执行的代码放到这个函数里，之前我写过一篇文章，详细写了python线程的使用方式，一般来说我们经常使用的就是函数或者用类来包装线程对象，例子里这次我们用的是类包装的方法，就是直接从threading.Thread继承，然后重写__init__方法和run方法，例子很简单，不详细解释代码含义了，有兴趣的可以再这个代码上进行修改，实现更复杂的爬虫程序。

M	T	W	T	F	S	S
« Jul
	1	2	3	4	5	6
7	8	9	10	11	12	13
14	15	16	17	18	19	20
21	22	23	24	25	26	27
28	29	30	31

头脑的思考

头脑的思考

最简单的多线程爬虫实例