05
03
月
2020
爬取笔趣阁全文,并将文章保存在数据库中
本文主要使用工具为:
- Python3.8
- xpath
- MySQL
首先设计数据库:
- 数据库设置为表名为文章名
- 表内有三列数据,分别为id name nickname 分别为序号 章节名 和文章内容
创建数据库的sql语句为
CREATE TABLE if not exists
%s
(id
INT NOT NULL AUTO_INCREMENT PRIMARY KEY,name
VARCHAR(200) NOT NULL,nickname
text NULL)
接下来,设计爬取思路
- 第一步首先爬取主连接,并爬取下来文章名称以及文章连接
- 第二步爬取文章的章节名
- 爬取文章的主要内容
为了方便,设置了两个爬取存储,一个是使用os存储在本地文件夹内,另一个则是存储在数据库内
整体代码如下:
#coding:utf-8
import requests
from lxml import etree
import os
import re
from threading import Thread
import MySQLdb
class PAQU(object):
def first(self, url):
try:
response = requests.get(url, timeout = 100000)
html = etree.HTML(response.text) # html 结构化
except TimeoutError:
print("网页连接超时")
else:
First_list = html.xpath('//div[@class="novellist"]/ul/li/a/@href') #爬取下来的小说第一层连接
First_name = html.xpath('//div[@class="novellist"]/ul/li/a/text()') #获取小说名字
print(First_list, First_name)
for first_list, first_name in zip(First_list, First_name):
if os.path.exists(first_name) == False: # 判断当前路径下有没有此文件夹
os.makedirs(first_name) # 创建文件夹
conn = MySQLdb.connect(
host='127.0.0.1',
user='root',
passwd='root',
db='xiaoshuo',
port=3306,
charset='utf8'
)
# 准备sql
sql = "CREATE TABLE if not exists `%s` (`id` INT NOT NULL AUTO_INCREMENT PRIMARY KEY,`name` VARCHAR(200) NOT NULL,`nickname` text NULL)"
# 创建表
cursor = conn.cursor()
cursor.execute(sql, (first_name,))
rest = cursor.fetchone()
print(rest)
# 关闭连接
conn.close()
self.second(first_list, first_name)
def second(self, first, first_name):
try:
response = requests.get(first)
response.encoding = 'utf-8'
except TimeoutError:
print("网页连接超时")
else:
html = etree.HTML(response.text)
Second_list = html.xpath('//div[@class="box_con"]/div[@id="list"]/dl/dd/a/@href') #取得每章的地址
Second_list_name = html.xpath('//div[@class="box_con"]/div[@id="list"]/dl/dd/a/text()') #获取每章的名字
for second, second_name in zip(Second_list, Second_list_name):
self.third(second, second_name, first_name)
def third(self, second, second_name, first_name):
try:
response = requests.get("http://www.xbiquge.la" + second)
response.encoding = 'utf-8'
except TimeoutError:
print("网页连接失败")
else:
html = etree.HTML(response.text)
Third_list = "\n".join(html.xpath('//div[@class="content_read"]/div[@class="box_con"]/div[@id="content"]/text()')) #获取章节
conn = MySQLdb.connect(
host='127.0.0.1',
user='root',
passwd='root',
db='xiaoshuo',
port=3306,
charset='utf8'
)
# 准备sql
sql = "INSERT INTO `%s` (`name`,`nickname`) VALUE (%s,%s)"
# 创建表
cursor = conn.cursor()
cursor.execute(sql, (first_name, second_name, Third_list))
conn.commit()
# 关闭连接
conn.close()
rstr = r"[\/\\\:\*\?\"\<\>\|]" # '/ \ : * ? " < > |'
second_name1 = re.sub(rstr, "_", second_name)
rstr = r'\\\'\t'
second_name2 = re.sub(rstr, "_", second_name1)
file_name = first_name + "\\" + second_name2 + ".text"
print("正在写入文件" + file_name)
try:
with open(file_name, "a", encoding="utf-8") as f:
f.write(Third_list)
except OSError:
print("文件名称有错误")
if __name__ == "__main__":
url1 = "http://www.xbiquge.la/xiaoshuodaquan/"
paqu = PAQU()
t1 = Thread(target=paqu.first(url1))
t1.start()
t1.join()
延伸阅读
- 20.12.25 dos命令的学习以及初步使用变量
- 11.24 学习Go程序中的注意事项等
- 12.23 学习Go的基本安装以及输出Go的Hello word
- 创建一个新的Django项目
- Python中import出现错误,无法找到文件
- PyCharm中设置文件头的作者、时间等信息
- 随笔
- 18种基础正则表达式
更多阅读
- 上一篇:Python爬虫的阶段行学习导读
- 下一篇:Pycharm专业版破解过程