Spray's Blog

Home Works About Email

爬取知乎收藏夹

是这样的，我的知乎收藏了很多文章，但是知乎的收藏夹却没有搜索功能，这使得我很难通过搜索找到自己收藏过的文章。

所以我打算用 Python 爬取我的知乎收藏夹，并保存下来，然后就可以很方便的搜索了。

但是其实我并不是很会爬虫，所以我选择使用模拟浏览器行为并下载网页再分析的方式爬取。

以下是代码（写的很丑），写了注释，比较简单。

from selenium import webdriver
from selenium.webdriver.edge.options import Options
from selenium.webdriver.edge.service import Service
from bs4 import BeautifulSoup
import time

# msedgedriver 路径
edge_driver_path = 'D:\\edgedriver_win64\\msedgedriver.exe'
options = Options()
# 无头，即不渲染
options.add_argument("--headless=new")
service = Service(edge_driver_path)
driver = webdriver.Edge(service=service, options=options)
# 页数
count = 47

for i in range(1, count+1):
    print('Group '+str(i))
    # 链接
    driver.get('https://www.zhihu.com/collection/920424365?page='+str(i))
    # 获取当前页面最下滚动位置
    last_height = driver.execute_script("return document.body.scrollHeight")
    while True:
        # 向下滚动并等待两秒加载
        driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")
        time.sleep(2)
        # 获取当前页面最下滚动位置
        new_height = driver.execute_script("return document.body.scrollHeight")
        # 无法滚动，页面结束
        if new_height == last_height:
            break
        last_height = new_height
    # 获取页面内容并保存
    html_content = driver.page_source
    with open(str(i)+'.html', 'w', encoding='utf-8') as file:
        file.write(html_content)
# 退出
driver.quit()

linkid = 0
output = ''
for i in range(1, count + 1):
    # 读取文件
    path = str(i)+'.html'
    htmlfile = open(path, 'r', encoding = 'utf-8')
    htmlhandle = htmlfile.read()
    # 分析
    soup = BeautifulSoup(htmlhandle, 'lxml')
    # 找到链接
    links = soup.find_all('a')
    for link in links:
        # 找到标题
        if link.get('data-za-detail-view-element_name') == 'Title':
            # 输出为 Markdown
            linkid = linkid+1
            output = output+str(linkid)+'. ['+link.get_text()+'](https:'+link.get('href')+')\n'
# 保存
with open('zhihu.md', 'w', encoding = 'utf-8') as file:
    file.write(output)

参考了 AI 给出的代码以及许多相关博客。

如果你要使用记得先安装依赖，懒得讲了，自行百度吧（逃~）。

代码下载：main.py

技术杂谈 — Feb 12, 2025