小涩席 发表于 2020-3-15 20:43

【Python爬虫】学习新思想,争做新青年。党建网!

应某位坛友要求,提供的爬虫。可学习新思想,提高觉悟。代码如下:

# -*- coding :'UTF-8' -*-
# http://dangjian.com/djw2016sy/djw2016wkztl/wkztl2016xihy/index.shtml
# Author:XSX
# Python3.8 PyCharm Community Edition 2019.3.3

import requests
from lxml import etree
import os
import time

def GetHomeLinks(url, headers):
    HomepageLinks = []
    r = requests.get(url, headers=headers)
    html = etree.HTML(r.text)
    HomeLinks = html.xpath('//div[@class="main-left"]/ul/li/div/a/@href')
    for HomeLink in HomeLinks:
      htmlPage = 'http://dangjian.com/djw2016sy/djw2016wkztl/wkztl2016xihy' + str(HomeLink)
      HomepageLinks.append(htmlPage)
    print(HomepageLinks)
    return HomepageLinks

def DownloadPage(HomepageLinks, headers):
    if not os.path.exists("./News"):
      os.mkdir("./News")
    for HomepageLink in HomepageLinks:
      time.sleep(3)
      r1 = requests.get(HomepageLink, headers=headers)
      r1.encoding = r1.apparent_encoding
      html1 = etree.HTML(r1.text)
      Titles = html1.xpath('//div[@id="title_tex"]/text()')
      Textdatas = html1.xpath('//div[@class="TRS_Editor"]/p/text()')
      NeiRong = str(Titles) + '\n' + str(Textdatas).replace(r'\xa0', '').replace(r'\u3000', '')
      with open('./News/' + str(Titles).replace("['", "").replace("']", "") + '.txt', 'a')as f:
            f.write(NeiRong)
      print("已保存!")
    print("已全部下载!")

if __name__ == '__main__':
    headers = {
      'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3987.116 Safari/537.36',
      'Cookie': 'wdcid=7c80b781c03f1605; wdlast=1583386171'
    }
    url = "http://dangjian.com/djw2016sy/djw2016wkztl/wkztl2016xihy/index.shtml"
    DownloadPage(GetHomeLinks(url, headers), headers)

KbRDG16 发表于 2022-2-25 20:08

刚好需要 谢谢大佬

lKcE 发表于 2022-2-28 10:26

谢谢分享

zarDKloV342 发表于 2022-3-20 18:17

谢谢分享

XeTI3 发表于 2022-4-12 18:59

这个不错谢谢,看一下

vDyxMg0629 发表于 2022-4-15 16:44

看着很不错,回复一个看看

sjhvBc 发表于 2022-4-17 08:12

感谢楼主

ChfZm7 发表于 2022-4-17 08:19

谢谢分享

WsOZzodHtcip 发表于 2022-4-20 03:23

谢谢分享

Nrxu 发表于 2022-4-21 15:41

感谢楼主
页: [1] 2 3 4 5 6
查看完整版本: 【Python爬虫】学习新思想,争做新青年。党建网!