编程开源技术交流,分享技术与知识

网站首页 > 开源技术 正文

会Python就是可以为所欲为,网络小说免费看,更不会存在书荒

wxchong 2024-08-19 02:13:50 开源技术 7 ℃ 0 评论

闲来无事想看个小说,打算下载到电脑上看,找了半天,没找到可以下载的网站,于是就想自己爬取一下小说内容并保存到本地

关注后私信小编 PDF领取十套电子文档书籍

观察结构

下一章

然后开始创建scrapy项目:

其中sixmaospider.py:

# -*- coding: utf-8 -*-
import scrapy
from ..items import SixmaoItem
class SixmaospiderSpider(scrapy.Spider):
 name = 'sixmaospider'
 #allowed_domains = ['http://www.6mao.com']
 start_urls = ['http://www.6mao.com/html/40/40184/12601161.html'] #圣墟
 def parse(self, response):
 novel_biaoti = response.xpath('//div[@id="content"]/h1/text()').extract()
 #print(novel_biaoti)
 novel_neirong=response.xpath('//div[@id="neirong"]/text()').extract()
 print(novel_neirong)
 #print(len(novel_neirong))
 novelitem = SixmaoItem()
 novelitem['novel_biaoti'] = novel_biaoti[0]
 print(novelitem['novel_biaoti'])
 for i in range(0,len(novel_neirong),2):
 #print(novel_neirong[i])
 novelitem['novel_neirong'] = novel_neirong[i]
 yield novelitem
 #下一章
 nextPageURL = response.xpath('//div[@class="s_page"]/a/@href').extract() # 取下一页的地址
 nexturl='http://www.6mao.com'+nextPageURL[2]
 print('下一章',nexturl)
 if nexturl:
 url = response.urljoin(nexturl)
 # 发送下一页请求并调用parse()函数继续解析
 yield scrapy.Request(url, self.parse, dont_filter=False)
 pass
 else:
 print("退出")
 pass

pipelinesio.py 将内容保存到本地文件

import os
print(os.getcwd())
class SixmaoPipeline(object):
 def process_item(self, item, spider):
 #print(item['novel'])
 with open('./data/圣墟.txt', 'a', encoding='utf-8') as fp:
 fp.write(item['novel_neirong'])
 fp.flush()
 fp.close()
 return item
 print('写入文件成功')

items.py

import scrapy
class SixmaoItem(scrapy.Item):
 # define the fields for your item here like:
 # name = scrapy.Field()
 novel_biaoti=scrapy.Field()
 novel_neirong=scrapy.Field()
 pass

startsixmao.py,直接右键这个运行,项目就开始运行了

from scrapy.cmdline import execute
execute(['scrapy', 'crawl', 'sixmaospider'])

settings.py

LOG_LEVEL='INFO' #这是加日志
LOG_FILE='novel.log'
DOWNLOADER_MIDDLEWARES = {
 'sixmao.middlewares.SixmaoDownloaderMiddleware': 543,
 'scrapy.contrib.downloadermiddleware.useragent.UserAgentMiddleware' : None,
 'sixmao.rotate_useragent.RotateUserAgentMiddleware' :400 #这行是使用代理
}
ITEM_PIPELINES = {
 #'sixmao.pipelines.SixmaoPipeline': 300,
 'sixmao.pipelinesio.SixmaoPipeline': 300,
} #在pipelines输出管道加入这个
SPIDER_MIDDLEWARES = {
 'sixmao.middlewares.SixmaoSpiderMiddleware': 543,
} #打开中间件 其余地方应该不需要改变

rotate_useragent.py 给项目加代理,防止被服务器禁止

# 导入random模块
import random
# 导入useragent用户代理模块中的UserAgentMiddleware类
from scrapy.downloadermiddlewares.useragent import UserAgentMiddleware
# RotateUserAgentMiddleware类,继承 UserAgentMiddleware 父类
# 作用:创建动态代理列表,随机选取列表中的用户代理头部信息,伪装请求。
# 绑定爬虫程序的每一次请求,一并发送到访问网址。
# 发爬虫技术:由于很多网站设置反爬虫技术,禁止爬虫程序直接访问网页,
# 因此需要创建动态代理,将爬虫程序模拟伪装成浏览器进行网页访问。
class RotateUserAgentMiddleware(UserAgentMiddleware):
 def __init__(self, user_agent=''):
 self.user_agent = user_agent
 def process_request(self, request, spider):
 #这句话用于随机轮换user-agent
 ua = random.choice(self.user_agent_list)
 if ua:
 # 输出自动轮换的user-agent
 print(ua)
 request.headers.setdefault('User-Agent', ua)
 # the default user_agent_list composes chrome,I E,firefox,Mozilla,opera,netscape
 # for more user agent strings,you can find it in http://www.useragentstring.com/pages/useragentstring.php
 # 编写头部请求代理列表
 user_agent_list = [\
 "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/22.0.1207.1 Safari/537.1"\
 "Mozilla/5.0 (X11; CrOS i686 2268.111.0) AppleWebKit/536.11 (KHTML, like Gecko) Chrome/20.0.1132.57 Safari/536.11",\
 "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.6 (KHTML, like Gecko) Chrome/20.0.1092.0 Safari/536.6",\
 "Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.6 (KHTML, like Gecko) Chrome/20.0.1090.0 Safari/536.6",\
 "Mozilla/5.0 (Windows NT 6.2; WOW64) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/19.77.34.5 Safari/537.1",\
 "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/536.5 (KHTML, like Gecko) Chrome/19.0.1084.9 Safari/536.5",\
 "Mozilla/5.0 (Windows NT 6.0) AppleWebKit/536.5 (KHTML, like Gecko) Chrome/19.0.1084.36 Safari/536.5",\
 "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1063.0 Safari/536.3",\
 "Mozilla/5.0 (Windows NT 5.1) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1063.0 Safari/536.3",\
 "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_8_0) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1063.0 Safari/536.3",\
 "Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1062.0 Safari/536.3",\
 "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1062.0 Safari/536.3",\
 "Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1061.1 Safari/536.3",\
 "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1061.1 Safari/536.3",\
 "Mozilla/5.0 (Windows NT 6.1) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1061.1 Safari/536.3",\
 "Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1061.0 Safari/536.3",\
 "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/535.24 (KHTML, like Gecko) Chrome/19.0.1055.1 Safari/535.24",\
 "Mozilla/5.0 (Windows NT 6.2; WOW64) AppleWebKit/535.24 (KHTML, like Gecko) Chrome/19.0.1055.1 Safari/535.24"
 ]

最终运行结果:

呐呐呐,这就是一个小的scrapy项目了

本文暂时没有评论,来添加一个吧(●'◡'●)

欢迎 发表评论:

最近发表
标签列表