Python-爬虫

Python 爬虫相关笔记

在我 1c2g 的小虚机上：

selenium 无论是本地安装，还是启动 selenium/standalone-chrome docker 使用 web driver，始终报错。
用 requests_html+pyppeteer 方案也不行，跑个小单测可以，跑的任务多了残留好多 chrome 进程，服务器卡死，并且 pyppeteer 已停止维护，建议使用 Playwright。
最终在极端有限的资源上，只有 playwright 是正常的。

Playwright 浏览器自动化

Selenium 浏览器自动化

SeleniumHQ / selenium
https://github.com/SeleniumHQ/selenium

Selenium with Python
https://selenium-python.readthedocs.io/index.html
https://github.com/baijum/selenium-python

Selenium 是一个开源的 Web 自动化测试框架，支持多种浏览器（Chrome、Firefox、Edge 等）和操作系统（Windows、macOS、Linux）。其核心功能是通过代码模拟用户操作（点击、输入、导航等），适用于自动化测试、数据抓取和业务流程自动化

建议使用 Playwright 代替 Selenium

安装
pip install selenium

Selenium pageLoadStrategy 页面加载策略

pageLoadStrategy 是 Selenium 中用于控制页面加载策略的一个配置项。它决定了 Selenium 在何时认为页面已经加载完成，从而可以继续执行后续的自动化代码。

Normal 默认策略，Selenium 会等待页面完全加载完成，包括所有的资源（如图片、样式表等）都加载完毕。
Eager Selenium 会在 DOMContentLoaded 事件触发后认为页面已经加载完成，此时页面的 DOM 树已经构建完毕，但可能还有样式表、图片等资源正在加载。
None Selenium 不会等待任何页面加载事件，初始 HTML 下载后立即返回控制权给自动化代码。这意味着页面可能还在加载中，但代码已经开始执行。

通过 Option 设置：

from selenium.webdriver.chrome.options import Options as ChromeOptions
from selenium.webdriver.firefox.options import Options as FirefoxOptions

# chrome
options = ChromeOptions()
options.page_load_strategy = 'none'  # none 不等待完整加载，直接继续执行后续代码
driver = webdriver.Chrome(options=options)

# firefox
options = FirefoxOptions()
options.page_load_strategy = 'none'  # none 不等待完整加载，直接继续执行后续代码
driver = webdriver.Firefox(options=options)

或者，通过 capabilities 设置

from selenium.webdriver.common.desired_capabilities import DesiredCapabilities

capabilities = DesiredCapabilities.CHROME
capabilities["pageLoadStrategy"] = "none"  # 不等待完整加载，直接继续执行后续代码
driver = webdriver.Chrome(desired_capabilities=capabilities)
driver.get(url)  # 立即返回，不阻塞

Selenium 连接远程 driver

options = ChromeOptions()
options.add_argument("--headless=new")  # 无界面模式（headless mode）不显示图形用户界面运行浏览器
options.add_argument("--no-sandbox")  # 禁用沙盒模式
options.add_argument("--disable-gpu")
options.add_argument("--remote-allow-origins=*")
options.add_argument("--disable-dev-shm-usage")  # 防止内存不足问题
options.add_argument("--disable-blink-features=AutomationControlled")  # 规避自动化检测
options.add_experimental_option("excludeSwitches", ["enable-automation"])  # 隐藏自动化标志
driver = webdriver.Remote(command_executor=remote_driver_url, options=options)

Selenium 爬取的网页是英文

问题：
默认的 selenium/standalone-chrome 镜像启动后默认语言为英文，导致打开的网页也是优先英文，进而爬取的网页是英文内容

解决：
1、创建驱动时增加 lang 参数：

# chrome
options = ChromeOptions()
options.add_argument("--lang=zh-CN.UTF-8")
# firefox
options = FirefoxOptions()
options.set_preference("intl.accept_languages", "zh-CN.UTF-8")

我爬取的网站，这样设置后就可以了。

2、如果不行，启动镜像时设置 LANG 环境变量为 zh_CN.UTF-8

docker run -d -p 4444:4444 \
    --shm-size=2g \
    -e TZ=Asia/Shanghai \
    -e LANG=zh_CN.UTF-8 \
    selenium/standalone-chrome

Mac 手动安装 Chrome 驱动

1、查看 chrome 版本：
浏览器输入：
chrome://version/

Google Chrome    133.0.6943.142 (正式版本) (arm64) 
修订版本    f217c2438a8e1f4b9e730de378ce20f754b2c3d0-refs/branch-heads/6943@{#1913}
操作系统    macOS 版本15.3.1（版号24D70）
JavaScript    V8 13.3.415.23
用户代理    Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/133.0.0.0 Safari/537.36
命令行    /Applications/Google Chrome.app/Contents/MacOS/Google Chrome --restart --flag-switches-begin --flag-switches-end
可执行文件路径    /Applications/Google Chrome.app/Contents/MacOS/Google Chrome
个人资料路径    /Users/masi/Library/Application Support/Google/Chrome/Default
Linker    lld
命令行变体    eyJkaXNhYmxlLWZlYXR1cmVzIjoiQXVkaW9JbnB1dENvbmZpcm1SZWFkc1Zp...    
使用中的变体    e61eae14-377be55a

2、下载和 chrome 浏览器版本相同的驱动：
https://googlechromelabs.github.io/chrome-for-testing/

3、解压后将 chromedriver-mac-arm64 中的 chromedriver 可执行程序放到 /usr/local/bin 或当前虚拟环境 bin 目录中

webdriver_manager 自动安装浏览器驱动

pip install webdriver_manager
webdriver_manager 是 Python 中的一个库，用于管理 Web 驱动程序。它的作用是自动下载和设置不同浏览器（如 Chrome、Firefox、Edge 等）的 Web 驱动程序，以便在自动化测试中使用这些浏览器。
使用 Selenium 等爬虫时需要一个与浏览器相匹配的 Web 驱动程序，以便控制和操作浏览器。webdriver_manager 提供了一种简便的方式，可以自动检测所需浏览器的版本并下载相应的 Web 驱动程序。这样，您就不需要手动下载和设置 Web 驱动程序，可以减轻您的负担，提高测试的可靠性和可维护性。

使用方法：

from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from webdriver_manager.chrome import ChromeDriverManager

# 使用 ChromeDriverManager 安装 ChromeDriver，并返回驱动程序的路径
driver_path = ChromeDriverManager().install()
# 打印驱动程序的路径
print(driver_path)

# 创建 ChromeDriver 服务，并指定驱动程序的路径
service = Service(driver_path)
# 创建 Chrome WebDriver，并指定服务
driver = webdriver.Chrome(service=service)
# 打开百度网页
driver.get("https://www.baidu.com")

Selenium 4.6+ 内置 Selenium Manager 可自动下载浏览器驱动

Introducing Selenium Manager
https://www.selenium.dev/blog/2022/introducing-selenium-manager/

Selenium 4.6.0 Released!
https://www.selenium.dev/blog/2022/selenium-4-6-0-released/

从 Selenium 4.6 版本开始，不需要手动下载浏览器驱动，也不需要用三方包 webdriver_manager 来下载驱动，Selenium 内置了 Selenium Manager 这个驱动管理工具。
直接创建 Chrome 或 Firefox 实例即可：

driver = webdriver.Chrome(options=options)
driver = webdriver.Firefox(options=options)

调用 webdriver.Firefox() 或 webdriver.Chrome() 时，如果未显式指定 service 参数，Selenium 会通过内置的 Selenium Manager 自动查找或下载驱动：
如果检测到系统中未已安装浏览器驱动，自动从官方源下载匹配浏览器版本的驱动，并将驱动缓存到本地，默认路径为 ~/.cache/selenium，后续无需重复下载。

Docker 启动 selenium chrome driver(放弃)

https://hub.docker.com/r/selenium/standalone-chrome

放弃：
1、一开始 docker 19.03.5 版本上 selenium/standalone-chrome 启动后始终有报错，端口不通，无法连接，升级 docker 26.1.4 后可启动成功。
2、但 py 中连接 web driver 始终报错 webdriver.Remote(command_executor='http://localhost:4444/wd/hub', options=options)，应该是我 1c2g 小虚机上资源不足的问题。

启动 selenium/standalone-chrome 容器

docker 版本：26.1.4
docker pull selenium/standalone-chrome

启动镜像：

docker run -d --rm \
--name selenium-chrome \
--network=host \
-e SE_NODE_PORT=8444 \
-e SE_VNC_PORT=8900 \
-e TZ=Asia/Shanghai \
-e LANG=zh_CN.UTF-8 \
-e SE_VNC_NO_PASSWORD=1 \
-e SE_NODE_SESSION_TIMEOUT=180 \
--ulimit nofile=32768:32768 \
--shm-size="2g" \
selenium/standalone-chrome

–shm-size：解决容器内存不足导致的崩溃问题‌

Docker 19.x 上 selenium-chrome 无法使用

docker 19.03.5 版本上，selenium-chrome 容器启动后日志中总是有错误，且 localhost:4444 端口无法响应，升级最新版本 docker 后解决。

$ docker logs -f selenium-chrome
Virtual environment detected at /opt/venv, activating...
Python 3.12.3
2025-06-21 14:38:06,702 INFO Included extra file "/etc/supervisor/conf.d/chrome-cleanup.conf" during parsing
2025-06-21 14:38:06,702 INFO Included extra file "/etc/supervisor/conf.d/recorder.conf" during parsing
2025-06-21 14:38:06,702 INFO Included extra file "/etc/supervisor/conf.d/selenium.conf" during parsing
2025-06-21 14:38:06,702 INFO Included extra file "/etc/supervisor/conf.d/uploader.conf" during parsing
2025-06-21 14:38:06,710 INFO RPC interface 'supervisor' initialized
2025-06-21 14:38:06,710 INFO supervisord started with pid 8
2025-06-21 14:38:07,713 INFO spawned: 'xvfb' with pid 9
2025-06-21 14:38:07,716 INFO spawned: 'vnc' with pid 10
2025-06-21 14:38:07,718 INFO spawned: 'novnc' with pid 11
2025-06-21 14:38:07,720 INFO spawned: 'selenium-standalone' with pid 12
2025-06-21 14:38:07,750 INFO success: xvfb entered RUNNING state, process has stayed up for > than 0 seconds (startsecs)
2025-06-21 14:38:07,750 INFO success: vnc entered RUNNING state, process has stayed up for > than 0 seconds (startsecs)
2025-06-21 14:38:07,750 INFO success: novnc entered RUNNING state, process has stayed up for > than 0 seconds (startsecs)
2025-06-21 14:38:07,750 INFO success: selenium-standalone entered RUNNING state, process has stayed up for > than 0 seconds (startsecs)
2025-06-21 14:38:07,785 WARN exited: novnc (exit status 1; not expected)

localhost:4444/wd/hub/status 测试

测试 curl http://localhost:4444/wd/hub/status

$ curl http://localhost:4444/wd/hub/status
{
  "value": {
    "ready": true,
    "message": "Selenium Grid ready.",
    "nodes": [
      {
        "id": "8c69cf9c-95d0-4dd7-8a7e-8ec9f0afbf4d",
        "uri": "http:\u002f\u002f172.17.0.1:4444",
        "maxSessions": 1,
        "sessionTimeout": 180000,
        "osInfo": {
          "arch": "amd64",
          "name": "Linux",
          "version": "3.10.0-957.1.3.el7.x86_64"
        },
        "heartbeatPeriod": 30000,
        "availability": "UP",
        "version": "4.33.0 (revision 2c6aaad03a)",
        "slots": [
          {
            "id": {
              "hostId": "8c69cf9c-95d0-4dd7-8a7e-8ec9f0afbf4d",
              "id": "3d57a713-fcff-4ac5-8bbb-92f23600fadc"
            },
            "lastStarted": "1970-01-01T00:00:00Z",
            "session": null,
            "stereotype": {
              "browserName": "chrome",
              "browserVersion": "137.0",
              "container:hostname": "lightsail",
              "goog:chromeOptions": {
                "binary": "\u002fusr\u002fbin\u002fgoogle-chrome"
              },
              "platformName": "linux",
              "se:containerName": "lightsail",
              "se:noVncPort": 7900,
              "se:vncEnabled": true
            }
          }
        ]
      }
    ]
  }
}

pyppeteer 获取网页(停止维护)

Pyppeteer 是 Puppeteer 的非官方 Python 移植版，基于 Chromium 实现异步浏览器控制。依赖 Python 3.5+ 的 asyncio 库，适合 Python 开发者集成爬虫或自动化任务。

pyppeteer 已停止维护，建议使用 Playwright
https://github.com/pyppeteer/pyppeteer

Pyppeteer 核心功能：

功能对齐 Puppeteer：支持截图、PDF、表单提交、动态内容抓取。
异步高效：基于协程并发操作，性能优于 Selenium 等传统工具。
- 简化部署：首次运行时自动下载 Chromium（pip install pyppeteer）

安装

pip install pyppeteer
pyppeteer-install  # 安装 Chromium

requests_html 获取动态网页

requests_html 是一个强大而简洁的 Python 库，它扩展了经典的 requests 库，添加了 HTML 解析和 JavaScript 渲染功能，尤其擅长处理动态加载的网页内容。

pip install requests-html

核心功能：

内置 Pyppeteer 渲染引擎（无需额外配置）
支持异步加载内容的抓取
自动执行页面中的 JavaScript 代码

代码示例

from requests_html import HTMLSession

session = HTMLSession()
r = session.get('https://dynamic-website.com')
r.html.render()  # 执行 JavaScript 并渲染动态内容
print(r.html.html)  # 输出渲染后的完整 HTML

首次运行 render() 时会自动下载 Chromium 浏览器

# Linux 上
[INFO] Starting Chromium download.
2025-06-23 11:01:48 [MainThread] INFO: Starting Chromium download.
100%|██████████| 183M/183M [00:04<00:00, 36.8Mb/s]
[INFO] Beginning extraction
2025-06-23 11:01:53 [MainThread] INFO: Beginning extraction
[INFO] Chromium extracted to: /home/centos/.local/share/pyppeteer/local-chromium/1181205
2025-06-23 11:02:09 [MainThread] INFO: Chromium extracted to: /home/centos/.local/share/pyppeteer/local-chromium/1181205

# Mac 上
2025-06-23 10:16:13 [MainThread] INFO: Starting Chromium download.
100%|███████████████████████████████████████████| 141M/141M [29:57<00:00, 78.5kb/s]
[INFO] Beginning extraction
2025-06-23 10:46:12 [MainThread] INFO: Beginning extraction
[INFO] Chromium extracted to: /Users/masi/Library/Application Support/pyppeteer/local-chromium/1181205
2025-06-23 10:46:15 [MainThread] INFO: Chromium extracted to: /Users/xxx/Library/Application Support/pyppeteer/local-chromium/1181205

为避免首次 render() 时下载浏览器耗时，可以在容器内提前下载，Dockerfile 中增加：

RUN python -c "from pyppeteer import chromium_downloader; chromium_downloader.download_chromium()"

DrissionPage 网页爬虫

https://drissionpage.cn/
DrissionPage 是一个基于 Python 的网页自动化工具。
既能控制浏览器，也能收发数据包，还能把两者合而为一。
可兼顾浏览器自动化的便利性和 requests 的高效率。
功能强大，语法简洁优雅，代码量少，对新手友好。

feapder 网页爬虫

https://feapder.com/#/
https://github.com/Boris-code/feapder

feapder 是一款上手简单，功能强大的Python爬虫框架，内置AirSpider、Spider、TaskSpider、BatchSpider四种爬虫解决不同场景的需求。
支持断点续爬、监控报警、浏览器渲染、海量数据去重等功能。
更有功能强大的爬虫管理系统feaplat为其提供方便的部署及调度

采集动态页面时（Ajax渲染的页面），常用的有两种方案：

一种是找接口拼参数，这种方式比较复杂但效率高，需要一定的爬虫功底；
另外一种是采用浏览器渲染的方式，直接获取源码，简单方便

浏览器渲染-Selenium

框架内置一个浏览器渲染池，默认的池子大小为1，请求时重复利用浏览器实例，只有当代理失效请求异常时，才会销毁、创建一个新的浏览器实例
内置浏览器渲染支持 CHROME、EDGE、PHANTOMJS、FIREFOX

浏览器渲染-Playwright

框架支持playwright渲染下载，每个线程持有一个playwright实例
框架支持chromium、firefox、webkit 三种浏览器渲染

Browserless

https://www.browserless.io/

火车采集器

http://www.locoy.com/

火车头使用教程
https://greenhathg.github.io/2021/10/08/火车头使用教程/index.html

MediaCrawler - 自媒体平台爬虫

https://github.com/NanmiCoder/MediaCrawler

页面解析

soup 解析 html dom 元素

安装

pip install beautifulsoup4
pip install lxml

BeautifulSoup(
    markup="",           # 要解析的文档内容
    features=None,       # 指定解析器类型（或通过第二个位置参数传入）
    builder=None,        # 自定义的解析器实例
    parse_only=None,     # 仅解析部分内容
    from_encoding=None,  # 指定原始编码
    exclude_encodings=None,  # 排除不可用的编码
    **kwargs             # 其他兼容性参数
)

第一个参数 markup 是必填的 html 源码
第二个参数 features 可以指定解析器，

BeautifulSoup(markup, 'html.parser') Python 内置，无需安装，速度中等
BeautifulSoup(markup, 'lxml') 速度快，需安装 lxml 库（推荐）。
BeautifulSoup(markup 'xml') 解析 XML，支持 XPath，速度快
BeautifulSoup(markup, 'html5lib') 容错性最强，生成浏览器兼容的 DOM，速度慢。

使用示例

from bs4 import BeautifulSoup

soup = BeautifulSoup('<item><name>Test</name></item>', 'lxml')
print(soup.item)

soup.find(id='mainText') # 查找 id='mainText' 的标签

readability-lxml 解析网页正文

https://pypi.org/project/readability-lxml/
https://github.com/buriy/python-readability

安装：

pip install readability-lxml
pip install lxml_html_clean

使用示例：

from bs4 import BeautifulSoup
from readability.readability import Document
from lxml import html


html = '<h1>Hello, World!</h1><p>This is an example.</p>'
doc = Document(html)
title = doc.short_title().strip()
# 默认 summary 返回结果带 <html>/<body> 标签，传入 html_partial=True 参数结果无<html>/<body> 标签，但也不是纯文本，还有 div p 等标签
html_content = doc.summary(html_partial=True)

# 利用 lxml 解析 HTML 文本
text_content_lxml = html.fromstring(html_content).text_content()

# 利用 soup 解析 HTML 文本
soup = BeautifulSoup(html_content, 'html.parser')
text_content_soup = soup.get_text(separator='\n', strip=True)

参数：

doc = Document(
    html_content,
    url="http://example.com",        # 用于处理相对链接
    positive_keywords=["tit", "text", "member"],  # 正向关键词白名单
    negative_keywords=["footer", "nav"],          # 负面关键词黑名单
    retry_length=250,                # 最小内容长度阈值（调低以保留短文本）
    xpath=False,                     # 禁用XPath模式（避免过度过滤）
    handle_failures="keep",          # 保留解析失败内容
)

python-readability 解析网页正文(火狐阅读模式)

https://pypi.org/project/python-readability/

JavaScript @mozilla/readability 的 Python 版
https://www.npmjs.com/package/@mozilla/readability
https://github.com/mozilla/readability

python-readability 与 readability-lxml 的区别：

python-readability 内部通过 node 直接运行原 js 代码，需要额外安装 js 运行环境（比如 nodejs）。readability-lxml 是 Python 代码实现。
测试过几个 HTML 网页解析，python-readability 解析效果更好。
python-readability 解析页面时的 node 进程出现过高 CPU 占用和卡死问题。

安装

pip install python-readability

使用示例：

from readability import parse

result = parse(html)
title = result.title
content = result.text_content

parse() 的返回结果结构和原 JavaScript @mozilla/readability 中的 parse() 一样

python-readability node 进程高CPU占用且卡死问题

python-readability 这个库有个很严重的问题，解析某些 HTML 网页时，会出现 readability node 进程 CPU 占用高且卡死问题，长期占用 1 核 CPU 不释放。
最终不得不放弃使用这个库，改用 readability-lxml

  PID USER      PR  NI    VIRT    RES    SHR S %CPU %MEM     TIME+ COMMAND
24283 root      20   0  673784  96612   3588 R 98.7  5.1  19:56.03 /usr/bin/node /usr/local/lib/python3.11/site-packages/readability/impl/stdio-worker.js

GeneralNewsExtractor(GNE) 解析网页

pip install gne
GNE: 通用新闻网站正文抽取器
https://github.com/GeneralNewsExtractor/GeneralNewsExtractor
https://generalnewsextractor.readthedocs.io/zh-cn/latest/

gne 输入是 HTML 源代码，没有请求网页的功能，需要自己先把网页 html 源码爬下来。

在线体验 https://gne.kingname.info/

User-Agent 解析

https://pypi.org/project/user-agents/
https://github.com/selwin/python-user-agents

pip install user-agents

自动安装

PyYAML==6.0.3
ua-parser==1.0.1
user-agents==2.2.0

解析的不是很准确，比如下面列举的都是搜索引擎爬虫的 user agent，但是 is_bot 属性很多是 false

from user_agents import parse

def parse_user_agents(user_agent):
    ua_parsed = parse(user_agent)
    print("=" * 50)
    print(f"user agent: {user_agent}")
    print(f"browser: {ua_parsed.browser}")
    print(f"browser family: {ua_parsed.browser.family}")
    print(f"browser version: {ua_parsed.browser.version}")

    print(f"os: {ua_parsed.os}")
    print(f"os family: {ua_parsed.os.family}")
    print(f"os version: {ua_parsed.os.version}")

    print(f"device: {ua_parsed.device}")
    print(f"device family: {ua_parsed.device.family}")
    print(f"device brand: {ua_parsed.device.brand}")
    print(f"device model: {ua_parsed.device.model}")

    print(f"is mobile: {ua_parsed.is_mobile}")
    print(f"is tablet: {ua_parsed.is_tablet}")
    print(f"is is_touch_capable: {ua_parsed.is_touch_capable}")
    print(f"is pc: {ua_parsed.is_pc}")
    print(f"is bot: {ua_parsed.is_bot}")


if __name__ == '__main__':
    user_agents = [
        'Mozilla/5.0 (Linux; Android 6.0.1; Nexus 5X Build/MMB29P) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/79.0.3945.120 Mobile Safari/537.36 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)',
        'Mozilla/5.0 (iPhone; CPU iPhone OS 9_1 like Mac OS X) AppleWebKit/601.1.46 (KHTML, like Gecko) Version/9.0 Mobile/13B143 Safari/601.1 (compatible; Baiduspider-render/2.0; +http://www.baidu.com/search/spider.html)',
        'Mozilla/5.0 (compatible; Baiduspider-render/2.0; +http://www.baidu.com/search/spider.html)',
        'Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko; compatible; Googlebot/2.1; +http://www.google.com/bot.html) Chrome/79.0.3945.120 Safari/537.36',
        'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/69.0.3497.81 YisouSpider/5.0 Safari/537.36',
        'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/17.4 Safari/605.1.15 (Applebot/0.1; +http://www.apple.com/go/applebot)',
        'Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko; compatible; bingbot/2.0; +http://www.bing.com/bingbot.htm) Chrome/136.0.0.0 Safari/537.36',
        'Mozilla/5.0 (Linux; Android 5.0) AppleWebKit/537.36 (KHTML, like Gecko) Mobile Safari/537.36 (compatible; Bytespider; https://zhanzhang.toutiao.com/)',
    ]
    for user_agent in user_agents:
        parse_user_agents(user_agent)

Google 抓取工具和抓取器（用户代理）概览

https://developers.google.com/crawling/docs/crawlers-fetchers/overview-google-crawlers?hl=zh-cn

当前位置 : 首页 » 文章分类 : » Python-爬虫