Python 爬虫四件套实战速查：Requests、Beautiful Soup、Scrapy、Selenium

预计阅读时间：8 分钟

学完爬虫基础后做总结测验，最容易暴露的不是"忘了某个 API"，而是"遇到真实场景不知道该选哪个工具"。Requests、Beautiful Soup、Scrapy、Selenium 各有明确的主场，混用或错用会让简单任务变复杂、复杂任务变失控。这篇文章用可运行的代码把四件套的核心用法和适用边界串一遍，当作你测验前的最后一轮实操复习。

Requests：先把页面拿回来

Requests 是整个爬虫链路的第一步——没有响应体，后面什么都解析不了。它的主场是静态页面、REST API、需要定制 Header 或 Cookie 的请求。

一个常见误区：拿到 HTML 就直接拼字符串提取数据。别这么做，交给解析库处理。

import requests

url = "https://books.toscrape.com/"
headers = {"User-Agent": "Mozilla/5.0 (compatible; PyScraper/1.0)"}

resp = requests.get(url, headers=headers, timeout=10)
resp.raise_for_status()  # 非 200 直接抛异常，别让静默失败溜过去

print(resp.status_code, len(resp.text))

几点实战提醒：

超时必设：timeout 不设，遇到无响应服务器会永远挂住。
raise_for_status()：比手动判断 resp.status_code == 200 更可靠，4xx/5xx 全覆盖。
Session 复用：多次请求同一站点时用 requests.Session()，TCP 连接和 Cookie 都会自动保持。

session = requests.Session()
session.headers.update({"User-Agent": "Mozilla/5.0 (compatible; PyScraper/1.0)"})

# 第一次请求登录，后续自动带 Cookie
session.post("https://example.com/login", data={"user": "test", "pass": "test"})
page = session.get("https://example.com/dashboard")
print(page.text[:200])

Beautiful Soup：从 HTML 里捞出结构化数据

Requests 把页面拿回来了，Beautiful Soup 负责把 HTML 拆成可查询的对象树。它不是最快的解析器，但 API 最直观，适合中小规模、结构相对稳定的页面。

from bs4 import BeautifulSoup

html = resp.text  # 接上一段 Requests 的结果
soup = BeautifulSoup(html, "lxml")  # lxml 比 html.parser 快且容错更好

# 提取所有书名和价格
for book in soup.select("article.product_pod"):
    title = book.h3.a["title"]
    price = book.select_one(".price_color").text
    print(f"{title} — {price}")

选择器优先级建议：

场景	推荐方式
有明确 class/id	`soup.select()` / `select_one()`（CSS 选择器）
按标签层级遍历	`soup.find()` / `find_all()`
需要父节点、兄弟节点	`.parent`、`.next_sibling` 等属性导航

测验中常考的坑：find_all 返回的是列表，不是单个元素；select_one 返回 None 时直接 .text 会抛 AttributeError——务必先判空。

Scrapy：当你要爬整站

单页用 Requests + BS4 足够，但翻页 500 次、提取 10 种字段、还要存数据库和去重，就该上 Scrapy 了。它是框架，不是库——自带调度、管道、中间件、自动限速。

一个最小可运行的 Scrapy 项目：

# 创建项目
scrapy startproject bookscraper
cd bookscraper

编辑 bookscraper/spiders/books_spider.py：

import scrapy

class BooksSpider(scrapy.Spider):
    name = "books"
    start_urls = ["https://books.toscrape.com/"]

    def parse(self, response):
        for book in response.css("article.product_pod"):
            yield {
                "title": book.css("h3 a::attr(title)").get(),
                "price": book.css(".price_color::text").get(),
            }

        # 自动翻页
        next_page = response.css(".next a::attr(href)").get()
        if next_page:
            yield response.follow(next_page, callback=self.parse)

运行并导出 JSON：

scrapy crawl books -o books.json

Scrapy 的测验重点通常在：

yield 机制：parse 返回字典走 Item Pipeline，返回 Request 走调度器，两者可以混用。
response.follow()：自动补全相对 URL，比手动拼接 urljoin 更安全。
去重：Scrapy 默认按 URL 去重，重复请求不会发出去；如需强制重发，加 dont_filter=True。

Selenium：页面需要 JavaScript 才出内容

前三件套都拿不到 JS 渲染后的 DOM。如果目标页面的数据由前端脚本异步填充，Selenium 是最直接的解法——它驱动真实浏览器，等 JS 执行完再取结果。

from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC

options = webdriver.ChromeOptions()
options.add_argument("--headless=new")  # 无头模式，服务器上也能跑
options.add_argument("--disable-gpu")

driver = webdriver.Chrome(options=options)
driver.get("https://quotes.toscrape.com/js/")

# 等待特定元素出现，而不是硬 sleep
wait = WebDriverWait(driver, 10)
wait.until(EC.presence_of_element_located((By.CLASS_NAME, "quote")))

quotes = driver.find_elements(By.CLASS_NAME, "quote")
for q in quotes:
    text = q.find_element(By.CLASS_NAME, "text").text
    author = q.find_element(By.CLASS_NAME, "author").text
    print(f"{author}: {text}")

driver.quit()

测验和实战中都要注意的细节：

用 WebDriverWait 而不是 time.sleep()：硬等 5 秒有时不够，有时浪费；显式等待刚好在元素出现时继续。
driver.quit() 必须调用：否则 Chrome 进程残留在内存里，跑几百次后机器就卡死了。
翻页场景：点击下一页按钮后，DOM 会变，需要重新 wait.until 再提取。

选型决策清单

测验里最容易丢分的不是代码细节，而是"该用哪个工具"的判断。这里给一个简单决策路径：

页面是静态 HTML？
  ├─ 是 → Requests + Beautiful Soup
  ├─ 否（需要 JS 渲染）→ Selenium
  └─ 需要大规模爬取整站？
       ├─ 是 → Scrapy（静态页用 Scrapy 内置 Selector）
       ├─ 是 + 需要 JS → Scrapy + Selenium 中间件（scrapy-selenium）
       └─ 否 → Requests + BS4 足够

几个容易踩的边界：

Scrapy 不等于 Requests 的替代品：单页抓取用 Scrapy 是过度工程，框架开销（项目结构、配置）不值得。
Selenium 不是万能的：如果 API 接口能直接返回 JSON，别用 Selenium 去点按钮——直接 requests.get(api_url) 快 10 倍以上。
Beautiful Soup 和 Scrapy Selector 不要混用：Scrapy 的 response.css() / response.xpath() 已经内置了 lxml，再传给 BS4 是多余转换。

测验前把上面四段代码各跑一遍，观察输出，比纯看笔记有效得多。真实爬虫的坑不在 API 调用，而在超时、编码、反爬、DOM 变动——这些只有跑起来才会碰到。