如何借助代理ip获取UC新闻内容
2022-09-23
选择优质的ip代理,我们可以用它来完成很多网络工作,比如在线大数据捕获,实际上是依靠ip代理来做。接下来IPIDEA介绍一个爬行新闻网站内容的教程。
IPIDEA以UC网站为例子:
这个网站并没有太复杂的访问虫,我们可以直接解析爬取就好。
from bs4 import BeautifulSoup
from urllib import request
def download(title,url):
req = request.Request(url)
response = request.urlopen(req)
response = response.read().decode(utf-8)
soup = BeautifulSoup(response,lxml)
tag = soup.find(div,class_=sm-article-content)
if tag == None:
return 0
title = title.replace(:,)
title = title.replace(",)
title = title.replace(|,)
title = title.replace(/,)
title = title.replace(\,)
title = title.replace(*,)
title = title.replace(<,)
title = title.replace(>,)
title = title.replace(?,)
with open(rD:codepythonspider_newsUC_newssociety\ + title + .txt,w,encoding=utf-8) as file_object:
file_object.write( )
file_object.write(title)
file_object.write( )
file_object.write(该新闻地址:)
file_object.write(url)
file_object.write( )
file_object.write(tag.get_text())
#print(正在爬取)
if __name__ == __main__:
for i in range(0,7):
url = https://news.uc.cn/c_shehui/
# headers = {"User-Agent":"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/61.0.3163.91 Safari/537.36",
# "cookie":"sn=3957284397500558579; _uc_pramas=%7B%22fr%22%3A%22pc%22%7D"}
# res = request.Request(url,headers = headers)
res = request.urlopen(url)
req = res.read().decode(utf-8)
soup = BeautifulSoup(req,lxml)
#print(soup.prettify())
tag = soup.find_all(div,class_ = txt-area-title)
#print(tag.name)
for x in tag:
news_url = https://news.uc.cn + x.a.get(href)
print(x.a.string,news_url)
download(x.a.string,news_url)
这样,我们就完成了UC数据的抓取,通过检查运行结果可以看到,是否成功获得数据。
声明:本文来自网络投稿,不代表IPIDEA立场,若存在侵权、安全合规问题,请及时联系IPIDEA进行删除。
上一篇:代理服务器有哪些作用?
下一篇:如何判断ip代理是否能用?