转载自: 分享几个很刑的工具
最近在玩很刑的项目。目标站点大量使用 js 渲染,没办法简单地拿到数据了。之前有群友在群里讨论过很刑的渲染器,今天简单总结一下。
简介
目前使用比较多的 4 个工具:
- Playwright[1]
- Splash[2]
- Selenium[3]
- Puppeteer[4]
这四个工具本身可以单独使用,最初设计的目标可能是不刑的(主要用于自动化测试),但用户太聪明,把它们变刑了。
本文简单介绍一下这 4 个工具和 Scrapy[5] 的组合使用。
- Scrapy Playwright[6]
- Scrapy Splash[7]
- Scrapy Selenium[8]
- Scrapy Puppeteer[9]
Scrapy 是个很刑的工具,Python 开发的。所以这几个“插件”也是 Python 写的,使用了对应工具的 Python 客户端。
Scrapy Playwright
很刑的一个工具,推荐!
安装
pip install scrapy-playwright
playwright install
配置
# settings.py
DOWNLOAD_HANDLERS = {
"http": "scrapy\_playwright.handler.ScrapyPlaywrightDownloadHandler",
"https": "scrapy\_playwright.handler.ScrapyPlaywrightDownloadHandler",
}
TWISTED_REACTOR = "twisted.internet.asyncioreactor.AsyncioSelectorReactor"
使用
# spiders/quotes.py
import scrapy
from scrapy_playwright_demo.items import QuoteItem
class QuotesSpider(scrapy.Spider):
name = 'quotes'
def start\_requests(self):
url = "https://quotes.toscrape.com/js/"
yield scrapy.Request(url, meta={'playwright': True})
def parse(self, response):
quote_item = QuoteItem()
for quote in response.css('div.quote'):
quote_item['text'] = quote.css('span.text::text').get()
quote_item['author'] = quote.css('small.author::text').get()
quote_item['tags'] = quote.css('div.tags a.tag::text').getall()
yield quote_item
特点
基本用到的功能它都有了:
- 在返回响应前等待元素加载
- 滚动页面
- 点击页面元素
- 对页面进行截图(这个很好用,我用它替换了 pyecharts 的保存图片)
- 创建页面的 PDF
- 使用代理
- 创建浏览器上下文
- 等等
Scrapy Splash
Scrapy Splash 是一个轻量级浏览器,它启动一个 HTTP 服务器,用户可以通过其 HTTP API 发送 URL 请求来渲染页面。
今天,Scrapy Splash 有些过时了,已被 Playwright 、 Puppeteer 等无头浏览器超越,不过烂船还有三斤钉,它仍然是一个非常强大的用于网络抓取的无头浏览器。
使用 Splash 需要运行 Splash Docker 镜像,稍稍麻烦。
安装
- 安装 Scrapy Splash Docker 镜像
docker pull scrapinghub/splash
- 运行 Scrapy Splash
docker run -it -p 8050:8050 --rm scrapinghub/splash
- 安装
pip install scrapy-splash
配置
# settings.py
# Splash Server Endpoint
SPLASH_URL = 'http://192.168.59.103:8050'
# Enable Splash downloader middleware and change HttpCompressionMiddleware priority
DOWNLOADER_MIDDLEWARES = {
'scrapy\_splash.SplashCookiesMiddleware': 723,
'scrapy\_splash.SplashMiddleware': 725,
'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware': 810,
}
# Enable Splash Deduplicate Args Filter
SPIDER_MIDDLEWARES = {
'scrapy\_splash.SplashDeduplicateArgsMiddleware': 100,
}
# Define the Splash DupeFilter
DUPEFILTER_CLASS = 'scrapy\_splash.SplashAwareDupeFilter'
使用
# spiders/quotes.py
import scrapy
from demo.items import QuoteItem
from scrapy_splash import SplashRequest
class QuotesSpider(scrapy.Spider):
name = 'quotes'
def start\_requests(self):
url = 'https://quotes.toscrape.com/js/'
yield SplashRequest(url, callback=self.parse)
def parse(self, response):
quote_item = QuoteItem()
for quote in response.css('div.quote'):
quote_item['text'] = quote.css('span.text::text').get()
quote_item['author'] = quote.css('small.author::text').get()
quote_item['tags'] = quote.css('div.tags a.tag::text').getall()
yield quote_item
特点
基本用到的功能它也都有了:
- 等待页面元素加载
- 滚动页面
- 点击页面元素
- 进行截图
- 关闭图片或使用 Adblock 规则以加快渲染速度
- 详尽的文档,经过了大量的爬虫测试,并且 Zyte 提供托管的 Splash 实例,无需管理浏览器本身。
Scrapy Selenium
Selenium 一直是网络抓取的最受欢迎的无头浏览器(特别是在 Python 中),然而,自从 Puppeteer 和 Playwright 的推出,它逐渐失宠。
不过,scrapy-selenium 最近一次更新已经是 3 年前了。
安装
pip install scrapy-selenium
下载驱动(注意版本),并放置在合适位置:
├── scrapy.cfg
├── chromedriver.exe ## <-- 放这里比较方便
└── myproject
├── __init__.py
├── items.py
├── middlewares.py
├── pipelines.py
├── settings.py
└── spiders
└── __init__.py
配置
## settings.py
# for chrome driver
from shutil import which
SELENIUM_DRIVER_NAME = 'chrome'
SELENIUM_DRIVER_EXECUTABLE_PATH = which('chromedriver')
SELENIUM_DRIVER_ARGUMENTS=['--headless']
DOWNLOADER_MIDDLEWARES = {'scrapy\_selenium.SeleniumMiddleware': 800}
使用
# spiders/quotes.py
import scrapy
from selenium_demo.items import QuoteItem
from scrapy_selenium import SeleniumRequest
class QuotesSpider(scrapy.Spider):
name = 'quotes'
def start\_requests(self):
url = 'https://quotes.toscrape.com/js/'
yield SeleniumRequest(url=url, callback=self.parse)
def parse(self, response):
quote_item = QuoteItem()
for quote in response.css('div.quote'):
quote_item['text'] = quote.css('span.text::text').get()
quote_item['author'] = quote.css('small.author::text').get()
quote_item['tags'] = quote.css('div.tags a.tag::text').getall()
yield quote_item
特点
中规中矩,前面有的基本都有。
Scrapy Puppeteer
本来是挺有希望的项目,但前两年被归档了,没有维护了。
安装
pip install scrapy-pyppeteer
配置
## settings.py
DOWNLOAD_HANDLERS = {
"http": "scrapy\_pyppeteer.handler.ScrapyPyppeteerDownloadHandler",
"https": "scrapy\_pyppeteer.handler.ScrapyPyppeteerDownloadHandler",
}
TWISTED_REACTOR = "twisted.internet.asyncioreactor.AsyncioSelectorReactor"
使用
# spiders/quotes.py
import scrapy
from pyppeteer_demo.items import QuoteItem
class QuotesSpider(scrapy.Spider):
name = 'quotes'
def start\_requests(self):
url = 'https://quotes.toscrape.com/js/'
yield scrapy.Request(url=url, callback=self.parse, meta={"pyppeteer": True})
def parse(self, response):
quote_item = QuoteItem()
for quote in response.css('div.quote'):
quote_item['text'] = quote.css('span.text::text').get()
quote_item['author'] = quote.css('small.author::text').get()
quote_item['tags'] = quote.css('div.tags a.tag::text').getall()
yield quote_item
特点
Public archive。
最后,友情提示一下:可不要太刑喔!
参考资料
[1] Playwright: https://github.com/microsoft/playwright
[2] Splash: https://github.com/scrapinghub/splash
[3] Selenium: https://github.com/SeleniumHQ/selenium
[4] Puppeteer: https://github.com/puppeteer/puppeteer
[5] Scrapy: https://github.com/scrapy/scrapy
[6] Scrapy Playwright: https://github.com/scrapy-plugins/scrapy-playwright
[7] Scrapy Splash: https://github.com/scrapy-plugins/scrapy-splash
[8] Scrapy Selenium: https://github.com/clemfromspace/scrapy-selenium
[9] Scrapy Puppeteer: https://github.com/ispras/scrapy-puppeteer