first commit
Some checks are pending
/ A job to automate contrib in readme (push) Waiting to run

This commit is contained in:
l0tk3 2024-07-15 16:33:05 +08:00
commit 76bd37dd11
128 changed files with 11672 additions and 0 deletions

3
.gitattributes vendored Normal file
View file

@ -0,0 +1,3 @@
*.js linguist-language=python
*.css linguist-language=python
*.html linguist-language=python

17
.github/workflows/main.yaml vendored Normal file
View file

@ -0,0 +1,17 @@
on:
push:
branches:
- main
jobs:
contrib-readme-job:
runs-on: ubuntu-latest
name: A job to automate contrib in readme
permissions:
contents: write
pull-requests: write
steps:
- name: Contribute List
uses: akhilmhdh/contributors-readme-action@v2.3.10
env:
GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}

175
.gitignore vendored Normal file
View file

@ -0,0 +1,175 @@
# Byte-compiled / optimized / DLL files
__pycache__/
*.py[cod]
*$py.class
# C extensions
*.so
# Distribution / packaging
.Python
build/
develop-eggs/
dist/
downloads/
eggs/
.eggs/
lib/
lib64/
parts/
sdist/
var/
wheels/
share/python-wheels/
*.egg-info/
.installed.cfg
*.egg
MANIFEST
# PyInstaller
# Usually these files are written by a python script from a template
# before PyInstaller builds the exe, so as to inject date/other infos into it.
*.manifest
*.spec
# Installer logs
pip-log.txt
pip-delete-this-directory.txt
# Unit test / coverage reports
htmlcov/
.tox/
.nox/
.coverage
.coverage.*
.cache
nosetests.xml
coverage.xml
*.cover
*.py,cover
.hypothesis/
.pytest_cache/
cover/
# Translations
*.mo
*.pot
# Django stuff:
*.log
local_settings.py
db.sqlite3
db.sqlite3-journal
# Flask stuff:
instance/
.webassets-cache
# Scrapy stuff:
.scrapy
# Sphinx documentation
docs/_build/
# PyBuilder
.pybuilder/
target/
# Jupyter Notebook
.ipynb_checkpoints
# IPython
profile_default/
ipython_config.py
# pyenv
# For a library or package, you might want to ignore these files since the code is
# intended to run in multiple environments; otherwise, check them in:
# .python-version
# pipenv
# According to pypa/pipenv#598, it is recommended to include Pipfile.lock in version control.
# However, in case of collaboration, if having platform-specific dependencies or dependencies
# having no cross-platform support, pipenv may install dependencies that don't work, or not
# install all needed dependencies.
#Pipfile.lock
# poetry
# Similar to Pipfile.lock, it is generally recommended to include poetry.lock in version control.
# This is especially recommended for binary packages to ensure reproducibility, and is more
# commonly ignored for libraries.
# https://python-poetry.org/docs/basic-usage/#commit-your-poetrylock-file-to-version-control
#poetry.lock
# pdm
# Similar to Pipfile.lock, it is generally recommended to include pdm.lock in version control.
#pdm.lock
# pdm stores project-wide configurations in .pdm.toml, but it is recommended to not include it
# in version control.
# https://pdm.fming.dev/#use-with-ide
.pdm.toml
# PEP 582; used by e.g. github.com/David-OConnor/pyflow and github.com/pdm-project/pdm
__pypackages__/
# Celery stuff
celerybeat-schedule
celerybeat.pid
# SageMath parsed files
*.sage.py
# Environments
.env
.venv
env/
venv/
ENV/
env.bak/
venv.bak/
# Spyder project settings
.spyderproject
.spyproject
# Rope project settings
.ropeproject
# mkdocs documentation
/site
# mypy
.mypy_cache/
.dmypy.json
dmypy.json
# Pyre type checker
.pyre/
# pytype static type analyzer
.pytype/
# Cython debug symbols
cython_debug/
# PyCharm
# JetBrains specific template is maintained in a separate JetBrains.gitignore that can
# be found at https://github.com/github/gitignore/blob/main/Global/JetBrains.gitignore
# and can be added to the global gitignore or merged into this file. For a more nuclear
# option (not recommended) you can uncomment the following to ignore the entire idea folder.
#.idea/
*.xml
*.iml
.idea
/temp_image/
/browser_data/
/data/
*/.DS_Store
.vscode
#New add
test_parse.py
test_soup.py
test.htmlcov

28
LICENSE Normal file
View file

@ -0,0 +1,28 @@
非商业使用许可证 1.0
版权所有 (c) [2024] [relakkes@gmail.com]
鉴于:
1. 版权所有者拥有和控制本软件和相关文档文件(以下简称“软件”)的版权;
2. 使用者希望使用该软件;
3. 版权所有者愿意在本许可证所述的条件下授权使用者使用该软件;
现因此,双方遵循相关法律法规,同意如下条款:
授权范围:
1. 版权所有者特此免费授予接受本许可证的任何自然人或法人(以下简称“使用者”)非独占的、不可转让的权利,在非商业目的下使用、复制、修改、合并本软件,前提是遵守以下条件。
条件:
1. 使用者必须在软件及其副本的所有合理显著位置包含上述版权声明和本许可证声明。
2. 本软件不得用于任何商业目的,包括但不限于销售、营利或商业竞争。
3. 未经版权所有者书面同意,不得将本软件用于任何商业用途。
免责声明:
1. 本软件按“现状”提供,不提供任何形式的明示或暗示保证,包括但不限于对适销性、特定用途的适用性和非侵权的保证。
2. 在任何情况下,版权所有者均不对因使用本软件而产生的,或在任何方式上与本软件有关的任何直接、间接、偶然、特殊、示例性或后果性损害负责(包括但不限于采购替代品或服务;使用、数据或利润的损失;或业务中断),无论这些损害是如何引起的,以及无论是通过合同、严格责任还是侵权行为(包括疏忽或其他方式)产生的,即使已被告知此类损害的可能性。
适用法律:
1. 本许可证的解释和执行应遵循当地法律法规。
2. 因本许可证引起的或与之相关的任何争议,双方应友好协商解决;协商不成时,任何一方可将争议提交至版权所有者所在地的人民法院诉讼解决。
本许可证构成双方之间关于本软件的完整协议,取代并合并以前的讨论、交流和协议,无论是口头还是书面的。

405
README.md Normal file
View file

@ -0,0 +1,405 @@
> **免责声明:**
>
> 大家请以学习为目的使用本仓库爬虫违法违规的案件https://github.com/HiddenStrawberry/Crawler_Illegal_Cases_In_China <br>
>
>本仓库的所有内容仅供学习和参考之用,禁止用于商业用途。任何人或组织不得将本仓库的内容用于非法用途或侵犯他人合法权益。本仓库所涉及的爬虫技术仅用于学习和研究,不得用于对其他平台进行大规模爬虫或其他非法行为。对于因使用本仓库内容而引起的任何法律责任,本仓库不承担任何责任。使用本仓库的内容即表示您同意本免责声明的所有条款和条件。
> 点击查看更为详细的免责声明。[点击跳转](#disclaimer)
# 仓库描述
**小红书爬虫****抖音爬虫** **快手爬虫** **B站爬虫** **微博爬虫**...。
目前能抓取小红书、抖音、快手、B站、微博的视频、图片、评论、点赞、转发等信息。
原理:利用[playwright](https://playwright.dev/)搭桥保留登录成功后的上下文浏览器环境通过执行JS表达式获取一些加密参数
通过使用此方式免去了复现核心加密JS代码逆向难度大大降低
## 功能列表
| 平台 | 关键词搜索 | 指定帖子ID爬取 | 二级评论 | 指定创作者主页 | 登录态缓存 | IP代理池 | 生成评论词云图 |
|-----|-------|----------|-----|--------|-------|-------|-------|
| 小红书 | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ |
| 抖音 | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ |
| 快手 | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ |
| B 站 | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ |
| 微博 | ✅ | ✅ | ❌ | ❌ | ✅ | ✅ | ✅ |
## 使用方法
### 创建并激活 python 虚拟环境
```shell
# 进入项目根目录
cd MediaCrawler
# 创建虚拟环境
# 注意python 版本需要3.7 - 3.9 高于该版本可能会出现一些依赖包兼容问题
python -m venv venv
# macos & linux 激活虚拟环境
source venv/bin/activate
# windows 激活虚拟环境
venv\Scripts\activate
```
### 安装依赖库
```shell
pip install -r requirements.txt
```
### 安装 playwright浏览器驱动
```shell
playwright install
```
### 运行爬虫程序
```shell
### 项目默认是没有开启评论爬取模式如需评论请在config/base_config.py中的 ENABLE_GET_COMMENTS 变量修改
### 一些其他支持项也可以在config/base_config.py查看功能写的有中文注释
# 从配置文件中读取关键词搜索相关的帖子并爬取帖子信息与评论
python main.py --platform xhs --lt qrcode --type search
# 从配置文件中读取指定的帖子ID列表获取指定帖子的信息与评论信息
python main.py --platform xhs --lt qrcode --type detail
# 打开对应APP扫二维码登录
# 其他平台爬虫使用示例,执行下面的命令查看
python main.py --help
```
### 数据保存
- 支持保存到关系型数据库Mysql、PgSQL等
- 执行 `python db.py` 初始化数据库数据库表结构(只在首次执行)
- 支持保存到csv中data/目录下)
- 支持保存到json中data/目录下)
## 开发者服务
- 知识星球:沉淀高质量常见问题、最佳实践文档、多年编程+爬虫经验分享,提供付费知识星球服务,主动提问,作者会定期回答问题 (每天 1 快钱订阅我的知识服务)
<p>
<img alt="xingqiu" src="https://nm.zizhi1.com/static/img/8e1312d1f52f2e0ff436ea7196b4e27b.15555424244122T1.webp" style="width: auto;height: 400px" >
</p>
星球精选文章:
- [【独创】使用Playwright获取某音a_bogus参数流程包含加密参数分析](https://articles.zsxq.com/id_u89al50jk9x0.html)
- [【独创】使用Playwright低成本获取某书X-s参数流程分析当年的回忆录](https://articles.zsxq.com/id_u4lcrvqakuc7.html)
- [ MediaCrawler-基于抽象类设计重构项目缓存](https://articles.zsxq.com/id_4ju73oxewt9j.html)
- [ 手把手带你撸一个自己的IP代理池](https://articles.zsxq.com/id_38fza371ladm.html)
- MediaCrawler视频课程
> 如果你想很快入门这个项目,或者想了具体实现原理,我推荐你看看这个视频课程,从设计出发一步步带你如何使用,门槛大大降低,同时也是对我开源的支持,如果你能支持我的课程,我将会非常开心~<br>
> 课程售价非常非常的便宜,几杯咖啡的事儿.<br>
> 课程介绍飞书文档链接https://relakkes.feishu.cn/wiki/JUgBwdhIeiSbAwkFCLkciHdAnhh
## 感谢下列Sponsors对本仓库赞助
- 感谢 [JetBrains](https://www.jetbrains.com/?from=gaowei-space/markdown-blog) 对本项目的支持!
<a href="https://www.jetbrains.com/?from=NanmiCoder/MediaCrawler" target="_blank">
<img src="https://resources.jetbrains.com/storage/products/company/brand/logos/jb_beam.png" width="100" height="100">
</a>
<br>
- <a href="https://sider.ai/ad-land-redirect?source=github&p1=mi&p2=kk">通过注册这个款免费的GPT助手帮我获取GPT4额度作为支持。也是我每天在用的一款chrome AI助手插件</a>
成为赞助者展示你的产品在这里联系作者relakkes@gmail.com
## MediaCrawler爬虫项目交流群
> 扫描下方我的个人微信备注github拉你进MediaCrawler项目交流群(请一定备注github会有wx小助手自动拉群)
>
> 如果图片展示不出来可以直接添加我的微信号yzglan
<div style="max-width: 200px">
<p><img alt="relakkes_wechat" src="static/images/relakkes_weichat.JPG" style="width: 200px;height: 100%" ></p>
</div>
## 运行报错常见问题Q&A
> 遇到问题先自行搜索解决下现在AI很火用ChatGPT大多情况下能解决你的问题 [免费的ChatGPT](https://sider.ai/ad-land-redirect?source=github&p1=mi&p2=kk)
➡️➡️➡️ [常见问题](docs/常见问题.md)
dy和xhs使用Playwright登录现在会出现滑块验证 + 短信验证,手动过一下
## 项目代码结构
➡️➡️➡️ [项目代码结构说明](docs/项目代码结构.md)
## 代理IP使用说明
➡️➡️➡️ [代理IP使用说明](docs/代理使用.md)
## 词云图相关操作说明
➡️➡️➡️ [词云图相关说明](docs/关于词云图相关操作.md)
## 手机号登录说明
➡️➡️➡️ [手机号登录说明](docs/手机号登录说明.md)
## 打赏
免费开源不易,如果项目帮到你了,可以给我打赏哦,您的支持就是我最大的动力!
<div style="display: flex;justify-content: space-between;width: 100%">
<p><img alt="打赏-微信" src="static/images/wechat_pay.jpeg" style="width: 200px;height: 100%" ></p>
<p><img alt="打赏-支付宝" src="static/images/zfb_pay.png" style="width: 200px;height: 100%" ></p>
</div>
## 爬虫入门课程
我新开的爬虫教程Github仓库 [CrawlerTutorial](https://github.com/NanmiCoder/CrawlerTutorial) ,感兴趣的朋友可以关注一下,持续更新,主打一个免费.
## 项目贡献者
> 感谢你们的贡献让项目变得更好贡献比较多的可以加我wx免费拉你进我的知识星球后期还有一些其他福利。
<!-- readme: contributors -start -->
<table>
<tbody>
<tr>
<td align="center">
<a href="https://github.com/NanmiCoder">
<img src="https://avatars.githubusercontent.com/u/47178017?v=4" width="100;" alt="NanmiCoder"/>
<br />
<sub><b>程序员阿江-Relakkes</b></sub>
</a>
</td>
<td align="center">
<a href="https://github.com/leantli">
<img src="https://avatars.githubusercontent.com/u/117699758?v=4" width="100;" alt="leantli"/>
<br />
<sub><b>leantli</b></sub>
</a>
</td>
<td align="center">
<a href="https://github.com/Rosyrain">
<img src="https://avatars.githubusercontent.com/u/116946548?v=4" width="100;" alt="Rosyrain"/>
<br />
<sub><b>Rosyrain</b></sub>
</a>
</td>
<td align="center">
<a href="https://github.com/BaoZhuhan">
<img src="https://avatars.githubusercontent.com/u/140676370?v=4" width="100;" alt="BaoZhuhan"/>
<br />
<sub><b>Bao Zhuhan</b></sub>
</a>
</td>
<td align="center">
<a href="https://github.com/nelzomal">
<img src="https://avatars.githubusercontent.com/u/8512926?v=4" width="100;" alt="nelzomal"/>
<br />
<sub><b>zhounan</b></sub>
</a>
</td>
<td align="center">
<a href="https://github.com/Hiro-Lin">
<img src="https://avatars.githubusercontent.com/u/40111864?v=4" width="100;" alt="Hiro-Lin"/>
<br />
<sub><b>HIRO</b></sub>
</a>
</td>
</tr>
<tr>
<td align="center">
<a href="https://github.com/PeanutSplash">
<img src="https://avatars.githubusercontent.com/u/98582625?v=4" width="100;" alt="PeanutSplash"/>
<br />
<sub><b>PeanutSplash</b></sub>
</a>
</td>
<td align="center">
<a href="https://github.com/Ermeng98">
<img src="https://avatars.githubusercontent.com/u/55784769?v=4" width="100;" alt="Ermeng98"/>
<br />
<sub><b>Ermeng</b></sub>
</a>
</td>
<td align="center">
<a href="https://github.com/henryhyn">
<img src="https://avatars.githubusercontent.com/u/5162443?v=4" width="100;" alt="henryhyn"/>
<br />
<sub><b>Henry He</b></sub>
</a>
</td>
<td align="center">
<a href="https://github.com/Akiqqqqqqq">
<img src="https://avatars.githubusercontent.com/u/51102894?v=4" width="100;" alt="Akiqqqqqqq"/>
<br />
<sub><b>leonardoqiuyu</b></sub>
</a>
</td>
<td align="center">
<a href="https://github.com/jayeeliu">
<img src="https://avatars.githubusercontent.com/u/77389?v=4" width="100;" alt="jayeeliu"/>
<br />
<sub><b>jayeeliu</b></sub>
</a>
</td>
<td align="center">
<a href="https://github.com/ZuWard">
<img src="https://avatars.githubusercontent.com/u/38209256?v=4" width="100;" alt="ZuWard"/>
<br />
<sub><b>ZuWard</b></sub>
</a>
</td>
</tr>
<tr>
<td align="center">
<a href="https://github.com/Zzendrix">
<img src="https://avatars.githubusercontent.com/u/154900254?v=4" width="100;" alt="Zzendrix"/>
<br />
<sub><b>Zendrix</b></sub>
</a>
</td>
<td align="center">
<a href="https://github.com/chunpat">
<img src="https://avatars.githubusercontent.com/u/19848304?v=4" width="100;" alt="chunpat"/>
<br />
<sub><b>zhangzhenpeng</b></sub>
</a>
</td>
<td align="center">
<a href="https://github.com/tanpenggood">
<img src="https://avatars.githubusercontent.com/u/37927946?v=4" width="100;" alt="tanpenggood"/>
<br />
<sub><b>Sam Tan</b></sub>
</a>
</td>
<td align="center">
<a href="https://github.com/xbsheng">
<img src="https://avatars.githubusercontent.com/u/56357338?v=4" width="100;" alt="xbsheng"/>
<br />
<sub><b>xbsheng</b></sub>
</a>
</td>
<td align="center">
<a href="https://github.com/yangrq1018">
<img src="https://avatars.githubusercontent.com/u/25074163?v=4" width="100;" alt="yangrq1018"/>
<br />
<sub><b>Martin</b></sub>
</a>
</td>
<td align="center">
<a href="https://github.com/zhihuiio">
<img src="https://avatars.githubusercontent.com/u/165655688?v=4" width="100;" alt="zhihuiio"/>
<br />
<sub><b>zhihuiio</b></sub>
</a>
</td>
</tr>
<tr>
<td align="center">
<a href="https://github.com/renaissancezyc">
<img src="https://avatars.githubusercontent.com/u/118403818?v=4" width="100;" alt="renaissancezyc"/>
<br />
<sub><b>Ren</b></sub>
</a>
</td>
<td align="center">
<a href="https://github.com/Tianci-King">
<img src="https://avatars.githubusercontent.com/u/109196852?v=4" width="100;" alt="Tianci-King"/>
<br />
<sub><b>Wang Tianci</b></sub>
</a>
</td>
<td align="center">
<a href="https://github.com/Styunlen">
<img src="https://avatars.githubusercontent.com/u/30810222?v=4" width="100;" alt="Styunlen"/>
<br />
<sub><b>Styunlen</b></sub>
</a>
</td>
<td align="center">
<a href="https://github.com/Schofi">
<img src="https://avatars.githubusercontent.com/u/33537727?v=4" width="100;" alt="Schofi"/>
<br />
<sub><b>Schofi</b></sub>
</a>
</td>
<td align="center">
<a href="https://github.com/Klu5ure">
<img src="https://avatars.githubusercontent.com/u/166240879?v=4" width="100;" alt="Klu5ure"/>
<br />
<sub><b>Klu5ure</b></sub>
</a>
</td>
<td align="center">
<a href="https://github.com/keeper-jie">
<img src="https://avatars.githubusercontent.com/u/33612777?v=4" width="100;" alt="keeper-jie"/>
<br />
<sub><b>Kermit</b></sub>
</a>
</td>
</tr>
<tr>
<td align="center">
<a href="https://github.com/kexinoh">
<img src="https://avatars.githubusercontent.com/u/91727108?v=4" width="100;" alt="kexinoh"/>
<br />
<sub><b>KEXNA</b></sub>
</a>
</td>
<td align="center">
<a href="https://github.com/aa65535">
<img src="https://avatars.githubusercontent.com/u/5417786?v=4" width="100;" alt="aa65535"/>
<br />
<sub><b>Jian Chang</b></sub>
</a>
</td>
<td align="center">
<a href="https://github.com/522109452">
<img src="https://avatars.githubusercontent.com/u/16929874?v=4" width="100;" alt="522109452"/>
<br />
<sub><b>tianqing</b></sub>
</a>
</td>
</tr>
<tbody>
</table>
<!-- readme: contributors -end -->
## star 趋势图
- 如果该项目对你有帮助star一下 ❤️❤️❤️
[![Star History Chart](https://api.star-history.com/svg?repos=NanmiCoder/MediaCrawler&type=Date)](https://star-history.com/#NanmiCoder/MediaCrawler&Date)
## 参考
- xhs客户端 [ReaJason的xhs仓库](https://github.com/ReaJason/xhs)
- 短信转发 [参考仓库](https://github.com/pppscn/SmsForwarder)
- 内网穿透工具 [ngrok](https://ngrok.com/docs/)
## 免责声明
<div id="disclaimer">
### 1. 项目目的与性质
本项目(以下简称“本项目”)是作为一个技术研究与学习工具而创建的,旨在探索和学习网络数据采集技术。本项目专注于自媒体平台的数据爬取技术研究,旨在提供给学习者和研究者作为技术交流之用。
### 2. 法律合规性声明
本项目开发者(以下简称“开发者”)郑重提醒用户在下载、安装和使用本项目时,严格遵守中华人民共和国相关法律法规,包括但不限于《中华人民共和国网络安全法》、《中华人民共和国反间谍法》等所有适用的国家法律和政策。用户应自行承担一切因使用本项目而可能引起的法律责任。
### 3. 使用目的限制
本项目严禁用于任何非法目的或非学习、非研究的商业行为。本项目不得用于任何形式的非法侵入他人计算机系统,不得用于任何侵犯他人知识产权或其他合法权益的行为。用户应保证其使用本项目的目的纯属个人学习和技术研究,不得用于任何形式的非法活动。
### 4. 免责声明
开发者已尽最大努力确保本项目的正当性及安全性,但不对用户使用本项目可能引起的任何形式的直接或间接损失承担责任。包括但不限于由于使用本项目而导致的任何数据丢失、设备损坏、法律诉讼等。
### 5. 知识产权声明
本项目的知识产权归开发者所有。本项目受到著作权法和国际著作权条约以及其他知识产权法律和条约的保护。用户在遵守本声明及相关法律法规的前提下,可以下载和使用本项目。
### 6. 最终解释权
关于本项目的最终解释权归开发者所有。开发者保留随时更改或更新本免责声明的权利,恕不另行通知。
</div>

4
README.txt Normal file
View file

@ -0,0 +1,4 @@
小红书核心功能media_platform/xhs/core.py
增加了爬推荐的功能:
python main.py --platform xhs --lt qrcode --type explore
具体函数在core.py中的get_explore函数中

96
async_db.py Normal file
View file

@ -0,0 +1,96 @@
# -*- coding: utf-8 -*-
# @Author : relakkes@gmail.com
# @Time : 2024/4/6 14:21
# @Desc : 异步Aiomysql的增删改查封装
from typing import Any, Dict, List, Union
import aiomysql
class AsyncMysqlDB:
def __init__(self, pool: aiomysql.Pool) -> None:
self.__pool = pool
async def query(self, sql: str, *args: Union[str, int]) -> List[Dict[str, Any]]:
"""
从给定的 SQL 中查询记录返回的是一个列表
:param sql: 查询的sql
:param args: sql中传递动态参数列表
:return:
"""
async with self.__pool.acquire() as conn:
async with conn.cursor(aiomysql.DictCursor) as cur:
await cur.execute(sql, args)
data = await cur.fetchall()
return data or []
async def get_first(self, sql: str, *args: Union[str, int]) -> Union[Dict[str, Any], None]:
"""
从给定的 SQL 中查询记录返回的是符合条件的第一个结果
:param sql: 查询的sql
:param args:sql中传递动态参数列表
:return:
"""
async with self.__pool.acquire() as conn:
async with conn.cursor(aiomysql.DictCursor) as cur:
await cur.execute(sql, args)
data = await cur.fetchone()
return data
async def item_to_table(self, table_name: str, item: Dict[str, Any]) -> int:
"""
表中插入数据
:param table_name: 表名
:param item: 一条记录的字典信息
:return:
"""
fields = list(item.keys())
values = list(item.values())
fields = [f'`{field}`' for field in fields]
fieldstr = ','.join(fields)
valstr = ','.join(['%s'] * len(item))
sql = "INSERT INTO %s (%s) VALUES(%s)" % (table_name, fieldstr, valstr)
async with self.__pool.acquire() as conn:
async with conn.cursor(aiomysql.DictCursor) as cur:
await cur.execute(sql, values)
lastrowid = cur.lastrowid
return lastrowid
async def update_table(self, table_name: str, updates: Dict[str, Any], field_where: str,
value_where: Union[str, int, float]) -> int:
"""
更新指定表的记录
:param table_name: 表名
:param updates: 需要更新的字段和值的 key - value 映射
:param field_where: update 语句 where 条件中的字段名
:param value_where: update 语句 where 条件中的字段值
:return:
"""
upsets = []
values = []
for k, v in updates.items():
s = '`%s`=%%s' % k
upsets.append(s)
values.append(v)
upsets = ','.join(upsets)
sql = 'UPDATE %s SET %s WHERE %s="%s"' % (
table_name,
upsets,
field_where, value_where,
)
async with self.__pool.acquire() as conn:
async with conn.cursor() as cur:
rows = await cur.execute(sql, values)
return rows
async def execute(self, sql: str, *args: Union[str, int]) -> int:
"""
需要更新写入等操作的 excute 执行语句
:param sql:
:param args:
:return:
"""
async with self.__pool.acquire() as conn:
async with conn.cursor() as cur:
rows = await cur.execute(sql, args)
return rows

0
base/__init__.py Normal file
View file

71
base/base_crawler.py Normal file
View file

@ -0,0 +1,71 @@
from abc import ABC, abstractmethod
from typing import Dict, Optional
from playwright.async_api import BrowserContext, BrowserType
class AbstractCrawler(ABC):
@abstractmethod
async def start(self):
pass
@abstractmethod
async def search(self):
pass
@abstractmethod
async def launch_browser(self, chromium: BrowserType, playwright_proxy: Optional[Dict], user_agent: Optional[str],
headless: bool = True) -> BrowserContext:
pass
class AbstractLogin(ABC):
@abstractmethod
async def begin(self):
pass
@abstractmethod
async def login_by_qrcode(self):
pass
@abstractmethod
async def login_by_mobile(self):
pass
@abstractmethod
async def login_by_cookies(self):
pass
class AbstractStore(ABC):
@abstractmethod
async def store_content(self, content_item: Dict):
pass
@abstractmethod
async def store_comment(self, comment_item: Dict):
pass
# TODO support all platform
# only xhs is supported, so @abstractmethod is commented
# @abstractmethod
async def store_creator(self, creator: Dict):
pass
class AbstractStoreImage(ABC):
# TODO: support all platform
# only weibo is supported
# @abstractmethod
async def store_image(self, image_content_item: Dict):
pass
class AbstractApiClient(ABC):
@abstractmethod
async def request(self, method, url, **kwargs):
pass
@abstractmethod
async def update_cookies(self, browser_context: BrowserContext):
pass

0
cache/__init__.py vendored Normal file
View file

42
cache/abs_cache.py vendored Normal file
View file

@ -0,0 +1,42 @@
# -*- coding: utf-8 -*-
# @Author : relakkes@gmail.com
# @Name : 程序员阿江-Relakkes
# @Time : 2024/6/2 11:06
# @Desc : 抽象类
from abc import ABC, abstractmethod
from typing import Any, List, Optional
class AbstractCache(ABC):
@abstractmethod
def get(self, key: str) -> Optional[Any]:
"""
从缓存中获取键的值
这是一个抽象方法子类必须实现这个方法
:param key:
:return:
"""
raise NotImplementedError
@abstractmethod
def set(self, key: str, value: Any, expire_time: int) -> None:
"""
将键的值设置到缓存中
这是一个抽象方法子类必须实现这个方法
:param key:
:param value:
:param expire_time: 过期时间
:return:
"""
raise NotImplementedError
@abstractmethod
def keys(self, pattern: str) -> List[str]:
"""
获取所有符合pattern的key
:param pattern: 匹配模式
:return:
"""
raise NotImplementedError

29
cache/cache_factory.py vendored Normal file
View file

@ -0,0 +1,29 @@
# -*- coding: utf-8 -*-
# @Author : relakkes@gmail.com
# @Name : 程序员阿江-Relakkes
# @Time : 2024/6/2 11:23
# @Desc :
class CacheFactory:
"""
缓存工厂类
"""
@staticmethod
def create_cache(cache_type: str, *args, **kwargs):
"""
创建缓存对象
:param cache_type: 缓存类型
:param args: 参数
:param kwargs: 关键字参数
:return:
"""
if cache_type == 'memory':
from .local_cache import ExpiringLocalCache
return ExpiringLocalCache(*args, **kwargs)
elif cache_type == 'redis':
from .redis_cache import RedisCache
return RedisCache()
else:
raise ValueError(f'Unknown cache type: {cache_type}')

120
cache/local_cache.py vendored Normal file
View file

@ -0,0 +1,120 @@
# -*- coding: utf-8 -*-
# @Author : relakkes@gmail.com
# @Name : 程序员阿江-Relakkes
# @Time : 2024/6/2 11:05
# @Desc : 本地缓存
import asyncio
import time
from typing import Any, Dict, List, Optional, Tuple
from cache.abs_cache import AbstractCache
class ExpiringLocalCache(AbstractCache):
def __init__(self, cron_interval: int = 10):
"""
初始化本地缓存
:param cron_interval: 定时清楚cache的时间间隔
:return:
"""
self._cron_interval = cron_interval
self._cache_container: Dict[str, Tuple[Any, float]] = {}
self._cron_task: Optional[asyncio.Task] = None
# 开启定时清理任务
self._schedule_clear()
def __del__(self):
"""
析构函数清理定时任务
:return:
"""
if self._cron_task is not None:
self._cron_task.cancel()
def get(self, key: str) -> Optional[Any]:
"""
从缓存中获取键的值
:param key:
:return:
"""
value, expire_time = self._cache_container.get(key, (None, 0))
if value is None:
return None
# 如果键已过期则删除键并返回None
if expire_time < time.time():
del self._cache_container[key]
return None
return value
def set(self, key: str, value: Any, expire_time: int) -> None:
"""
将键的值设置到缓存中
:param key:
:param value:
:param expire_time:
:return:
"""
self._cache_container[key] = (value, time.time() + expire_time)
def keys(self, pattern: str) -> List[str]:
"""
获取所有符合pattern的key
:param pattern: 匹配模式
:return:
"""
if pattern == '*':
return list(self._cache_container.keys())
# 本地缓存通配符暂时将*替换为空
if '*' in pattern:
pattern = pattern.replace('*', '')
return [key for key in self._cache_container.keys() if pattern in key]
def _schedule_clear(self):
"""
开启定时清理任务,
:return:
"""
try:
loop = asyncio.get_event_loop()
except RuntimeError:
loop = asyncio.new_event_loop()
asyncio.set_event_loop(loop)
self._cron_task = loop.create_task(self._start_clear_cron())
def _clear(self):
"""
根据过期时间清理缓存
:return:
"""
for key, (value, expire_time) in self._cache_container.items():
if expire_time < time.time():
del self._cache_container[key]
async def _start_clear_cron(self):
"""
开启定时清理任务
:return:
"""
while True:
self._clear()
await asyncio.sleep(self._cron_interval)
if __name__ == '__main__':
cache = ExpiringLocalCache(cron_interval=2)
cache.set('name', '程序员阿江-Relakkes', 3)
print(cache.get('key'))
print(cache.keys("*"))
time.sleep(4)
print(cache.get('key'))
del cache
time.sleep(1)
print("done")

76
cache/redis_cache.py vendored Normal file
View file

@ -0,0 +1,76 @@
# -*- coding: utf-8 -*-
# @Author : relakkes@gmail.com
# @Name : 程序员阿江-Relakkes
# @Time : 2024/5/29 22:57
# @Desc : RedisCache实现
import pickle
import time
from typing import Any, List
from redis import Redis
from cache.abs_cache import AbstractCache
from config import db_config
class RedisCache(AbstractCache):
def __init__(self) -> None:
# 连接redis, 返回redis客户端
self._redis_client = self._connet_redis()
@staticmethod
def _connet_redis() -> Redis:
"""
连接redis, 返回redis客户端, 这里按需配置redis连接信息
:return:
"""
return Redis(
host=db_config.REDIS_DB_HOST,
port=db_config.REDIS_DB_PORT,
db=db_config.REDIS_DB_NUM,
password=db_config.REDIS_DB_PWD,
)
def get(self, key: str) -> Any:
"""
从缓存中获取键的值, 并且反序列化
:param key:
:return:
"""
value = self._redis_client.get(key)
if value is None:
return None
return pickle.loads(value)
def set(self, key: str, value: Any, expire_time: int) -> None:
"""
将键的值设置到缓存中, 并且序列化
:param key:
:param value:
:param expire_time:
:return:
"""
self._redis_client.set(key, pickle.dumps(value), ex=expire_time)
def keys(self, pattern: str) -> List[str]:
"""
获取所有符合pattern的key
"""
return [key.decode() for key in self._redis_client.keys(pattern)]
if __name__ == '__main__':
redis_cache = RedisCache()
# basic usage
redis_cache.set("name", "程序员阿江-Relakkes", 1)
print(redis_cache.get("name")) # Relakkes
print(redis_cache.keys("*")) # ['name']
time.sleep(2)
print(redis_cache.get("name")) # None
# special python type usage
# list
redis_cache.set("list", [1, 2, 3], 10)
_value = redis_cache.get("list")
print(_value, f"value type:{type(_value)}") # [1, 2, 3]

1
cmd_arg/__init__.py Normal file
View file

@ -0,0 +1 @@
from .arg import *

40
cmd_arg/arg.py Normal file
View file

@ -0,0 +1,40 @@
import argparse
import config
from tools.utils import str2bool
async def parse_cmd():
# 读取command arg
parser = argparse.ArgumentParser(description='Media crawler program.')
parser.add_argument('--platform', type=str, help='Media platform select (xhs | dy | ks | bili | wb)',
choices=["xhs", "dy", "ks", "bili", "wb"], default=config.PLATFORM)
parser.add_argument('--lt', type=str, help='Login type (qrcode | phone | cookie)',
choices=["qrcode", "phone", "cookie"], default=config.LOGIN_TYPE)
parser.add_argument('--type', type=str, help='crawler type (search | detail | creator)',
choices=["search", "detail", "creator", "explore"], default=config.CRAWLER_TYPE)
parser.add_argument('--start', type=int,
help='number of start page', default=config.START_PAGE)
parser.add_argument('--keywords', type=str,
help='please input keywords', default=config.KEYWORDS)
parser.add_argument('--get_comment', type=str2bool,
help='''whether to crawl level one comment, supported values case insensitive ('yes', 'true', 't', 'y', '1', 'no', 'false', 'f', 'n', '0')''', default=config.ENABLE_GET_COMMENTS)
parser.add_argument('--get_sub_comment', type=str2bool,
help=''''whether to crawl level two comment, supported values case insensitive ('yes', 'true', 't', 'y', '1', 'no', 'false', 'f', 'n', '0')''', default=config.ENABLE_GET_SUB_COMMENTS)
parser.add_argument('--save_data_option', type=str,
help='where to save the data (csv or db or json)', choices=['csv', 'db', 'json'], default=config.SAVE_DATA_OPTION)
parser.add_argument('--cookies', type=str,
help='cookies used for cookie login type', default=config.COOKIES)
args = parser.parse_args()
# override config
config.PLATFORM = args.platform
config.LOGIN_TYPE = args.lt
config.CRAWLER_TYPE = args.type
config.START_PAGE = args.start
config.KEYWORDS = args.keywords
config.ENABLE_GET_COMMENTS = args.get_comment
config.ENABLE_GET_SUB_COMMENTS = args.get_sub_comment
config.SAVE_DATA_OPTION = args.save_data_option
config.COOKIES = args.cookies

2
config/__init__.py Normal file
View file

@ -0,0 +1,2 @@
from .base_config import *
from .db_config import *

131
config/base_config.py Normal file
View file

@ -0,0 +1,131 @@
# 基础配置
PLATFORM = "xhs"
KEYWORDS = "python,golang"
LOGIN_TYPE = "qrcode" # qrcode or phone or cookie
COOKIES = ""
# 具体值参见media_platform.xxx.field下的枚举值暂时只支持小红书
SORT_TYPE = "popularity_descending"
# 具体值参见media_platform.xxx.field下的枚举值暂时只支持抖音
PUBLISH_TIME_TYPE = 0
CRAWLER_TYPE = "search" # 爬取类型search(关键词搜索) | detail(帖子详情)| creator(创作者主页数据)
# 是否开启 IP 代理
ENABLE_IP_PROXY = False
# 代理IP池数量
IP_PROXY_POOL_COUNT = 2
# 代理IP提供商名称
IP_PROXY_PROVIDER_NAME = "kuaidaili"
# 设置为True不会打开浏览器无头浏览器
# 设置False会打开一个浏览器
# 小红书如果一直扫码登录不通过,打开浏览器手动过一下滑动验证码
# 抖音如果一直提示失败,打开浏览器看下是否扫码登录之后出现了手机号验证,如果出现了手动过一下再试。
HEADLESS = False
# 是否保存登录状态
SAVE_LOGIN_STATE = True
# 数据保存类型选项配置,支持三种类型csv、db、json
SAVE_DATA_OPTION = "json" # csv or db or json
# 用户浏览器缓存的浏览器文件配置
USER_DATA_DIR = "%s_user_data_dir" # %s will be replaced by platform name
# 爬取开始页数 默认从第一页开始
START_PAGE = 1
# 爬取视频/帖子的数量控制
CRAWLER_MAX_NOTES_COUNT = 20
# 并发爬虫数量控制
MAX_CONCURRENCY_NUM = 4
# 是否开启爬图片模式, 默认不开启爬图片
ENABLE_GET_IMAGES = False
# 是否开启爬评论模式, 默认不开启爬评论
ENABLE_GET_COMMENTS = False
# 是否开启爬二级评论模式, 默认不开启爬二级评论, 目前仅支持 xhs, bilibili
# 老版本项目使用了 db, 则需参考 schema/tables.sql line 287 增加表字段
ENABLE_GET_SUB_COMMENTS = False
# 指定小红书需要爬虫的笔记ID列表
# 667a0c27000000001e010d42
XHS_SPECIFIED_ID_LIST = [
"6422c2750000000027000d88",
"64ca1b73000000000b028dd2",
"630d5b85000000001203ab41",
# ........................
]
# 指定抖音需要爬取的ID列表
DY_SPECIFIED_ID_LIST = [
"7280854932641664319",
"7202432992642387233"
# ........................
]
# 指定快手平台需要爬取的ID列表
KS_SPECIFIED_ID_LIST = [
"3xf8enb8dbj6uig",
"3x6zz972bchmvqe"
]
# 指定B站平台需要爬取的视频bvid列表
BILI_SPECIFIED_ID_LIST = [
"BV1d54y1g7db",
"BV1Sz4y1U77N",
"BV14Q4y1n7jz",
# ........................
]
# 指定微博平台需要爬取的帖子列表
WEIBO_SPECIFIED_ID_LIST = [
"4982041758140155",
# ........................
]
# 指定小红书创作者ID列表
XHS_CREATOR_ID_LIST = [
"5c4548d80000000006030727",
# "63e36c9a000000002703502b",
# ........................
]
# 指定Dy创作者ID列表(sec_id)
DY_CREATOR_ID_LIST = [
"MS4wLjABAAAATJPY7LAlaa5X-c8uNdWkvz0jUGgpw4eeXIwu_8BhvqE",
# ........................
]
# 指定bili创作者ID列表(sec_id)
BILI_CREATOR_ID_LIST = [
"20813884",
# ........................
]
# 指定快手创作者ID列表
KS_CREATOR_ID_LIST = [
"3x4sm73aye7jq7i",
# ........................
]
#词云相关
#是否开启生成评论词云图
ENABLE_GET_WORDCLOUD = False
# 自定义词语及其分组
#添加规则xx:yy 其中xx为自定义添加的词组yy为将xx该词组分到的组名。
CUSTOM_WORDS = {
'零几': '年份', # 将“零几”识别为一个整体
'高频词': '专业术语' # 示例自定义词
}
#停用(禁用)词文件路径
STOP_WORDS_FILE = "./docs/hit_stopwords.txt"
#中文字体文件路径
FONT_PATH= "./docs/STZHONGS.TTF"

20
config/db_config.py Normal file
View file

@ -0,0 +1,20 @@
import os
# mysql config
RELATION_DB_PWD = os.getenv("RELATION_DB_PWD", "123456")
RELATION_DB_USER = os.getenv("RELATION_DB_USER", "root")
RELATION_DB_HOST = os.getenv("RELATION_DB_HOST", "localhost")
RELATION_DB_PORT = os.getenv("RELATION_DB_PORT", "3306")
RELATION_DB_NAME = os.getenv("RELATION_DB_NAME", "media_crawler")
RELATION_DB_URL = f"mysql://{RELATION_DB_USER}:{RELATION_DB_PWD}@{RELATION_DB_HOST}:{RELATION_DB_PORT}/{RELATION_DB_NAME}"
# redis config
REDIS_DB_HOST = "127.0.0.1" # your redis host
REDIS_DB_PWD = os.getenv("REDIS_DB_PWD", "123456") # your redis password
REDIS_DB_PORT = os.getenv("REDIS_DB_PORT", 6379) # your redis port
REDIS_DB_NUM = os.getenv("REDIS_DB_NUM", 0) # your redis db num
# cache type
CACHE_TYPE_REDIS = "redis"
CACHE_TYPE_MEMORY = "memory"

96
db.py Normal file
View file

@ -0,0 +1,96 @@
# -*- coding: utf-8 -*-
# @Author : relakkes@gmail.com
# @Time : 2024/4/6 14:54
# @Desc : mediacrawler db 管理
import asyncio
from typing import Dict
from urllib.parse import urlparse
import aiofiles
import aiomysql
import config
from async_db import AsyncMysqlDB
from tools import utils
from var import db_conn_pool_var, media_crawler_db_var
def parse_mysql_url(mysql_url) -> Dict:
"""
从配置文件中解析db链接url给到aiomysql用因为aiomysql不支持直接以URL的方式传递链接信息
Args:
mysql_url: mysql://root:{RELATION_DB_PWD}@localhost:3306/media_crawler
Returns:
"""
parsed_url = urlparse(mysql_url)
db_params = {
'host': parsed_url.hostname,
'port': parsed_url.port or 3306,
'user': parsed_url.username,
'password': parsed_url.password,
'db': parsed_url.path.lstrip('/')
}
return db_params
async def init_mediacrawler_db():
"""
初始化数据库链接池对象并将该对象塞给media_crawler_db_var上下文变量
Returns:
"""
db_conn_params = parse_mysql_url(config.RELATION_DB_URL)
pool = await aiomysql.create_pool(
autocommit=True,
**db_conn_params
)
async_db_obj = AsyncMysqlDB(pool)
# 将连接池对象和封装的CRUD sql接口对象放到上下文变量中
db_conn_pool_var.set(pool)
media_crawler_db_var.set(async_db_obj)
async def init_db():
"""
初始化db连接池
Returns:
"""
utils.logger.info("[init_db] start init mediacrawler db connect object")
await init_mediacrawler_db()
utils.logger.info("[init_db] end init mediacrawler db connect object")
async def close():
"""
关闭连接池
Returns:
"""
utils.logger.info("[close] close mediacrawler db pool")
db_pool: aiomysql.Pool = db_conn_pool_var.get()
if db_pool is not None:
db_pool.close()
async def init_table_schema():
"""
用来初始化数据库表结构请在第一次需要创建表结构的时候使用多次执行该函数会将已有的表以及数据全部删除
Returns:
"""
utils.logger.info("[init_table_schema] begin init mysql table schema ...")
await init_mediacrawler_db()
async_db_obj: AsyncMysqlDB = media_crawler_db_var.get()
async with aiofiles.open("schema/tables.sql", mode="r") as f:
schema_sql = await f.read()
await async_db_obj.execute(schema_sql)
utils.logger.info("[init_table_schema] mediacrawler table schema init successful")
await close()
if __name__ == '__main__':
asyncio.get_event_loop().run_until_complete(init_table_schema())

BIN
docs/STZHONGS.TTF Normal file

Binary file not shown.

768
docs/hit_stopwords.txt Normal file
View file

@ -0,0 +1,768 @@
\n
———
》),
)÷(1-
”,
)、
:
&
*
一一
~~~~
.
.一
./
--
=″
[⑤]]
[①D]
ng昉
//
[②e]
[②g]
}
,也
[①⑥]
[②B]
[①a]
[④a]
[①③]
[③h]
③]
[②b]
×××
[①⑧]
[⑤b]
[②c]
[④b]
[②③]
[③a]
[④c]
[①⑤]
[①⑦]
[①g]
∈[
[①⑨]
[①④]
[①c]
[②f]
[②⑧]
[②①]
[①C]
[③c]
[③g]
[②⑤]
[②②]
一.
[①h]
.数
[①B]
数/
[①i]
[③e]
[①①]
[④d]
[④e]
[③b]
[⑤a]
[①A]
[②⑧]
[②⑦]
[①d]
[②j]
://
′∈
[②④
[⑤e]
...
...................
…………………………………………………③
[③F]
[①o]
]∧′=[
∪φ∈
②c
[③①]
[①E]
Ψ
.日
[②d]
[②
[②⑦]
[②②]
[③e]
[①i]
[①B]
[①h]
[①d]
[①g]
[①②]
[②a]
[⑩]
[①e]
[②h]
[②⑥]
[③d]
[②⑩]
元/吨
[②⑩]
[①]
::
[②]
[③]
[④]
[⑤]
[⑥]
[⑦]
[⑧]
[⑨]
……
——
?
,
'
?
·
———
──
?
<
>
[
]
(
)
-
+
×
/
В
"
;
#
@
γ
μ
φ
φ.
×
Δ
sub
exp
sup
sub
Lex
+ξ
-β
<±
<Δ
<λ
<φ
=
=☆
>λ
_
~±
[⑤f]
[⑤d]
[②i]
[②G]
[①f]
......
[③⑩]
第二
一番
一直
一个
一些
许多
有的是
也就是说
末##末
哎呀
哎哟
俺们
按照
吧哒
罢了
本着
比方
比如
鄙人
彼此
别的
别说
并且
不比
不成
不单
不但
不独
不管
不光
不过
不仅
不拘
不论
不怕
不然
不如
不特
不惟
不问
不只
朝着
趁着
除此之外
除非
除了
此间
此外
从而
但是
当着
的话
等等
叮咚
对于
多少
而况
而且
而是
而外
而言
而已
尔后
反过来
反过来说
反之
非但
非徒
否则
嘎登
各个
各位
各种
各自
根据
故此
固然
关于
果然
果真
哈哈
何处
何况
何时
哼唷
呼哧
还是
还有
换句话说
换言之
或是
或者
极了
及其
及至
即便
即或
即令
即若
即使
几时
既然
既是
继而
加之
假如
假若
假使
鉴于
较之
接着
结果
紧接着
进而
尽管
经过
就是
就是说
具体地说
具体说来
开始
开外
可见
可是
可以
况且
来着
例如
连同
两者
另外
另一方面
慢说
漫说
每当
莫若
某个
某些
哪边
哪儿
哪个
哪里
哪年
哪怕
哪天
哪些
哪样
那边
那儿
那个
那会儿
那里
那么
那么些
那么样
那时
那些
那样
乃至
你们
宁可
宁肯
宁愿
啪达
旁人
凭借
其次
其二
其他
其它
其一
其余
其中
起见
起见
岂但
恰恰相反
前后
前者
然而
然后
然则
人家
任何
任凭
如此
如果
如何
如其
如若
如上所述
若非
若是
上下
尚且
设若
设使
甚而
甚么
甚至
省得
时候
什么
什么样
使得
是的
首先
谁知
顺着
似的
虽然
虽说
虽则
随着
所以
他们
他人
它们
她们
倘或
倘然
倘若
倘使
通过
同时
万一
为何
为了
为什么
为着
嗡嗡
我们
呜呼
乌乎
无论
无宁
毋宁
相对而言
向着
沿
沿着
要不
要不然
要不是
要么
要是
也罢
也好
一般
一旦
一方面
一来
一切
一样
一则
依照
以便
以及
以免
以至
以至于
以致
抑或
因此
因而
因为
由此可见
由于
有的
有关
有些
于是
于是乎
与此同时
与否
与其
越是
云云
再说
再者
在下
咱们
怎么
怎么办
怎么样
怎样
照着
这边
这儿
这个
这会儿
这就是说
这里
这么
这么点儿
这么些
这么样
这时
这些
这样
正如
之类
之所以
之一
只是
只限
只要
只有
至于
诸位
着呢
自从
自个儿
自各儿
自己
自家
自身
综上所述
总的来看
总的来说
总的说来
总而言之
总之
纵令
纵然
纵使
遵照
作为
喔唷

47
docs/代理使用.md Normal file
View file

@ -0,0 +1,47 @@
## 代理 IP 使用说明
> 还是得跟大家再次强调下,不要对一些自媒体平台进行大规模爬虫或其他非法行为,要踩缝纫机的哦🤣
### 简易的流程图
![代理 IP 使用流程图](../static/images/代理IP%20流程图.drawio.png)
### 准备代理 IP 信息
点击 <a href="https://www.kuaidaili.com/?ref=ldwkjqipvz6c">快代理</a> 官网注册并实名认证(国内使用代理 IP 必须要实名,懂的都懂)
### 获取 IP 代理的密钥信息
<a href="https://www.kuaidaili.com/?ref=ldwkjqipvz6c">快代理</a> 官网获取免费试用,如下图所示
![img.png](../static/images/img.png)
注意:选择私密代理
![img_1.png](../static/images/img_1.png)
选择开通试用
![img_2.png](../static/images/img_2.png)
初始化一个快代理的示例如下代码所示需要4个参数
```python
def new_kuai_daili_proxy() -> KuaiDaiLiProxy:
"""
构造快代理HTTP实例
Returns:
"""
return KuaiDaiLiProxy(
kdl_secret_id=os.getenv("kdl_secret_id", "你的快代理secert_id"),
kdl_signature=os.getenv("kdl_signature", "你的快代理签名"),
kdl_user_name=os.getenv("kdl_user_name", "你的快代理用户名"),
kdl_user_pwd=os.getenv("kdl_user_pwd", "你的快代理密码"),
)
```
在试用的订单中可以看到这四个参数,如下图所示
`kdl_user_name`、`kdl_user_pwd`
![img_3.png](../static/images/img_3.png)
`kdl_secret_id`、`kdl_signature`
![img_4.png](../static/images/img_4.png)
### 将配置文件中的`ENABLE_IP_PROXY`置为 `True`
> `IP_PROXY_POOL_COUNT` 池子中 IP 的数量

View file

@ -0,0 +1,58 @@
# 关于词云图相关操作
### 1.如何正确调用词云图
***ps:目前只有保存格式为json文件时才会生成词云图。其他存储方式添加词云图将在近期添加。***
需要修改的配置项(./config/base_config.py
```python
# 数据保存类型选项配置,支持三种类型csv、db、json
#此处需要为json格式保存,原因如上
SAVE_DATA_OPTION = "json" # csv or db or json
```
```python
# 是否开启爬评论模式, 默认不开启爬评论
#此处为True,需要爬取评论才可以生成评论的词云图。
ENABLE_GET_COMMENTS = True
```
```python
#词云相关
#是否开启生成评论词云图
#打开词云图功能
ENABLE_GET_WORDCLOUD = True
```
```python
# 添加自定义词语及其分组
#添加规则xx:yy 其中xx为自定义添加的词组yy为将xx该词组分到的组名。
CUSTOM_WORDS = {
'零几': '年份', # 将“零几”识别为一个整体
'高频词': '专业术语' # 示例自定义词
}
```
```python
#停用(禁用)词文件路径
STOP_WORDS_FILE = "./docs/hit_stopwords.txt"
```
```python
#中文字体文件路径
FONT_PATH= "./docs/STZHONGS.TTF"
```
**相关解释**
- 自定义词组的添加,`xx:yy` 中`xx`为自定义词语,`yy`为`xx`分配词语的组别。`yy`可以随便给任意值。
- 如果需要添加禁用词,请在./docs/hit_stopwords.txt添加禁用词(保证格式正确,一个词语一行)
- `FONT_PATH`为生成词云图中中文字体的格式,默认为宋体。可以自行添加字体文件,修改路径。
## 2.生成词云图的位置
![image-20240627204928601](https://rosyrain.oss-cn-hangzhou.aliyuncs.com/img2/202406272049662.png)
如图在data文件下的`words文件夹`下其中json为词频统计文件png为词云图。原本的评论内容在`json文件夹`下。

31
docs/常见问题.md Normal file
View file

@ -0,0 +1,31 @@
## 常见程序运行出错问题
Q: 爬取抖音报错: `execjs._exceptions.ProgramError: SyntaxError: 缺少 ';'` <br>
A: 该错误为缺少 nodejs 环境,这个错误可以通过安装 nodejs 环境来解决,版本为:`v16.8.0` <br>
Q: 使用Cookie爬取抖音报错: execjs._exceptions.ProgramError: TypeError: Cannot read property 'JS_MD5_NO_COMMON_JS' of null
A: windows电脑去网站下载`https://nodejs.org/en/blog/release/v16.8.0` Windows 64-bit Installer 版本,一直下一步即可。
Q: 可以指定关键词爬取吗?<br>
A: 在config/base_config.py 中 KEYWORDS 参数用于控制需要爬取的关键词 <br>
Q: 可以指定帖子爬取吗?<br>
A在config/base_config.py 中 XHS_SPECIFIED_ID_LIST 参数用于控制需要指定爬取的帖子ID列表 <br>
Q: 刚开始能爬取数据,过一段时间就是失效了?<br>
A出现这种情况多半是由于你的账号触发了平台风控机制了请勿大规模对平台进行爬虫影响平台。<br>
Q: 如何更换登录账号?<br>
A删除项目根目录下的 brower_data/ 文件夹即可 <br>
Q: 报错 `playwright._impl._api_types.TimeoutError: Timeout 30000ms exceeded.`<br>
A: 出现这种情况检查下开梯子没有<br>
Q: 小红书扫码登录成功后如何手动验证?
A: 打开 config/base_config.py 文件, 找到 HEADLESS 配置项, 将其设置为 False, 此时重启项目, 在浏览器中手动通过验证码<br>
Q: 如何配置词云图的生成?
A: 打开 config/base_config.py 文件, 找到`ENABLE_GET_WORDCLOUD` 以及`ENABLE_GET_COMMENTS` 两个配置项将其都设为True即可使用该功能。<br>
Q: 如何给词云图添加禁用词和自定义词组?
A: 打开 `docs/hit_stopwords.txt` 输入禁用词(注意一个词语一行)。打开 config/base_config.py 文件找到 `CUSTOM_WORDS `按格式添加自定义词组即可。<br>

View file

@ -0,0 +1,20 @@
## 关于手机号+验证码登录的说明
当在浏览器模拟人为发起手机号登录请求时,使用短信转发软件将验证码发送至爬虫端回填,完成自动登录
准备工作:
- 安卓机1台IOS没去研究理论上监控短信也是可行的
- 安装短信转发软件 [参考仓库](https://github.com/pppscn/SmsForwarder)
- 转发软件中配置WEBHOOK相关的信息主要分为 消息模板请查看本项目中的recv_sms_notification.py、一个能push短信通知的API地址
- push的API地址一般是需要绑定一个域名的当然也可以是内网的IP地址我用的是内网穿透方式会有一个免费的域名绑定到内网的web
server内网穿透工具 [ngrok](https://ngrok.com/docs/)
- 安装redis并设置一个密码 [redis安装](https://www.cnblogs.com/hunanzp/p/12304622.html)
- 执行 `python recv_sms_notification.py` 等待短信转发器发送HTTP通知
- 执行手机号登录的爬虫程序 `python main.py --platform xhs --lt phone`
备注:
- 小红书这边一个手机号一天只能发10条短信悠着点目前在发验证码时还未触发滑块验证估计多了之后也会有~
- 短信转发软件会不会监控自己手机上其他短信内容?(理论上应该不会,因为[短信转发仓库](https://github.com/pppscn/SmsForwarder)
star还是蛮多的

View file

@ -0,0 +1,38 @@
## 项目代码结构
```
MediaCrawler
├── base
│ └── base_crawler.py # 项目的抽象类
├── browser_data # 换成用户的浏览器数据目录
├── config
│ ├── account_config.py # 账号代理池配置
│ ├── base_config.py # 基础配置
│ └── db_config.py # 数据库配置
├── data # 数据保存目录
├── libs
│ ├── douyin.js # 抖音Sign函数
│ └── stealth.min.js # 去除浏览器自动化特征的JS
├── media_platform
│ ├── douyin # 抖音crawler实现
│ ├── xhs # 小红书crawler实现
│ ├── bilibili # B站crawler实现
│ └── kuaishou # 快手crawler实现
├── modles
│ ├── douyin.py # 抖音数据模型
│ ├── xiaohongshu.py # 小红书数据模型
│ ├── kuaishou.py # 快手数据模型
│ └── bilibili.py # B站数据模型
├── tools
│ ├── utils.py # 暴露给外部的工具函数
│ ├── crawler_util.py # 爬虫相关的工具函数
│ ├── slider_util.py # 滑块相关的工具函数
│ ├── time_util.py # 时间相关的工具函数
│ ├── easing.py # 模拟滑动轨迹相关的函数
| └── words.py # 生成词云图相关的函数
├── db.py # DB ORM
├── main.py # 程序入口
├── var.py # 上下文变量定义
└── recv_sms_notification.py # 短信转发器的HTTP SERVER接口
```

578
libs/douyin.js Normal file

File diff suppressed because one or more lines are too long

7
libs/stealth.min.js vendored Normal file

File diff suppressed because one or more lines are too long

51
main.py Normal file
View file

@ -0,0 +1,51 @@
import asyncio
import sys
import cmd_arg
import config
import db
from base.base_crawler import AbstractCrawler
from media_platform.bilibili import BilibiliCrawler
from media_platform.douyin import DouYinCrawler
from media_platform.kuaishou import KuaishouCrawler
from media_platform.weibo import WeiboCrawler
from media_platform.xhs import XiaoHongShuCrawler
class CrawlerFactory:
CRAWLERS = {
"xhs": XiaoHongShuCrawler,
"dy": DouYinCrawler,
"ks": KuaishouCrawler,
"bili": BilibiliCrawler,
"wb": WeiboCrawler
}
@staticmethod
def create_crawler(platform: str) -> AbstractCrawler:
crawler_class = CrawlerFactory.CRAWLERS.get(platform)
if not crawler_class:
raise ValueError("Invalid Media Platform Currently only supported xhs or dy or ks or bili ...")
return crawler_class()
async def main():
# parse cmd
await cmd_arg.parse_cmd()
# init db
if config.SAVE_DATA_OPTION == "db":
await db.init_db()
crawler = CrawlerFactory.create_crawler(platform=config.PLATFORM)
await crawler.start()
if config.SAVE_DATA_OPTION == "db":
await db.close()
if __name__ == '__main__':
try:
# asyncio.run(main())
asyncio.get_event_loop().run_until_complete(main())
except KeyboardInterrupt:
sys.exit()

View file

View file

@ -0,0 +1,6 @@
# -*- coding: utf-8 -*-
# @Author : relakkes@gmail.com
# @Time : 2023/12/2 18:36
# @Desc :
from .core import *

View file

@ -0,0 +1,287 @@
# -*- coding: utf-8 -*-
# @Author : relakkes@gmail.com
# @Time : 2023/12/2 18:44
# @Desc : bilibili 请求客户端
import asyncio
import json
from typing import Any, Callable, Dict, List, Optional, Tuple, Union
from urllib.parse import urlencode
import httpx
from playwright.async_api import BrowserContext, Page
from base.base_crawler import AbstractApiClient
from tools import utils
from .exception import DataFetchError
from .field import CommentOrderType, SearchOrderType
from .help import BilibiliSign
class BilibiliClient(AbstractApiClient):
def __init__(
self,
timeout=10,
proxies=None,
*,
headers: Dict[str, str],
playwright_page: Page,
cookie_dict: Dict[str, str],
):
self.proxies = proxies
self.timeout = timeout
self.headers = headers
self._host = "https://api.bilibili.com"
self.playwright_page = playwright_page
self.cookie_dict = cookie_dict
async def request(self, method, url, **kwargs) -> Any:
async with httpx.AsyncClient(proxies=self.proxies) as client:
response = await client.request(
method, url, timeout=self.timeout,
**kwargs
)
data: Dict = response.json()
if data.get("code") != 0:
raise DataFetchError(data.get("message", "unkonw error"))
else:
return data.get("data", {})
async def pre_request_data(self, req_data: Dict) -> Dict:
"""
发送请求进行请求参数签名
需要从 localStorage wbi_img_urls 这参数值如下
https://i0.hdslb.com/bfs/wbi/7cd084941338484aae1ad9425b84077c.png-https://i0.hdslb.com/bfs/wbi/4932caff0ff746eab6f01bf08b70ac45.png
:param req_data:
:return:
"""
if not req_data:
return {}
img_key, sub_key = await self.get_wbi_keys()
return BilibiliSign(img_key, sub_key).sign(req_data)
async def get_wbi_keys(self) -> Tuple[str, str]:
"""
获取最新的 img_key sub_key
:return:
"""
local_storage = await self.playwright_page.evaluate("() => window.localStorage")
wbi_img_urls = local_storage.get("wbi_img_urls", "") or local_storage.get(
"wbi_img_url") + "-" + local_storage.get("wbi_sub_url")
if wbi_img_urls and "-" in wbi_img_urls:
img_url, sub_url = wbi_img_urls.split("-")
else:
resp = await self.request(method="GET", url=self._host + "/x/web-interface/nav")
img_url: str = resp['wbi_img']['img_url']
sub_url: str = resp['wbi_img']['sub_url']
img_key = img_url.rsplit('/', 1)[1].split('.')[0]
sub_key = sub_url.rsplit('/', 1)[1].split('.')[0]
return img_key, sub_key
async def get(self, uri: str, params=None, enable_params_sign: bool = True) -> Dict:
final_uri = uri
if enable_params_sign:
params = await self.pre_request_data(params)
if isinstance(params, dict):
final_uri = (f"{uri}?"
f"{urlencode(params)}")
return await self.request(method="GET", url=f"{self._host}{final_uri}", headers=self.headers)
async def post(self, uri: str, data: dict) -> Dict:
data = await self.pre_request_data(data)
json_str = json.dumps(data, separators=(',', ':'), ensure_ascii=False)
return await self.request(method="POST", url=f"{self._host}{uri}",
data=json_str, headers=self.headers)
async def pong(self) -> bool:
"""get a note to check if login state is ok"""
utils.logger.info("[BilibiliClient.pong] Begin pong bilibili...")
ping_flag = False
try:
check_login_uri = "/x/web-interface/nav"
response = await self.get(check_login_uri)
if response.get("isLogin"):
utils.logger.info(
"[BilibiliClient.pong] Use cache login state get web interface successfull!")
ping_flag = True
except Exception as e:
utils.logger.error(
f"[BilibiliClient.pong] Pong bilibili failed: {e}, and try to login again...")
ping_flag = False
return ping_flag
async def update_cookies(self, browser_context: BrowserContext):
cookie_str, cookie_dict = utils.convert_cookies(await browser_context.cookies())
self.headers["Cookie"] = cookie_str
self.cookie_dict = cookie_dict
async def search_video_by_keyword(self, keyword: str, page: int = 1, page_size: int = 20,
order: SearchOrderType = SearchOrderType.DEFAULT):
"""
KuaiShou web search api
:param keyword: 搜索关键词
:param page: 分页参数具体第几页
:param page_size: 每一页参数的数量
:param order: 搜索结果排序默认位综合排序
:return:
"""
uri = "/x/web-interface/wbi/search/type"
post_data = {
"search_type": "video",
"keyword": keyword,
"page": page,
"page_size": page_size,
"order": order.value
}
return await self.get(uri, post_data)
async def get_video_info(self, aid: Union[int, None] = None, bvid: Union[str, None] = None) -> Dict:
"""
Bilibli web video detail api, aid bvid任选一个参数
:param aid: 稿件avid
:param bvid: 稿件bvid
:return:
"""
if not aid and not bvid:
raise ValueError("请提供 aid 或 bvid 中的至少一个参数")
uri = "/x/web-interface/view/detail"
params = dict()
if aid:
params.update({"aid": aid})
else:
params.update({"bvid": bvid})
return await self.get(uri, params, enable_params_sign=False)
async def get_video_comments(self,
video_id: str,
order_mode: CommentOrderType = CommentOrderType.DEFAULT,
next: int = 0
) -> Dict:
"""get video comments
:param video_id: 视频 ID
:param order_mode: 排序方式
:param next: 评论页选择
:return:
"""
uri = "/x/v2/reply/wbi/main"
post_data = {
"oid": video_id,
"mode": order_mode.value,
"type": 1,
"ps": 20,
"next": next
}
return await self.get(uri, post_data)
async def get_video_all_comments(self, video_id: str, crawl_interval: float = 1.0, is_fetch_sub_comments=False,
callback: Optional[Callable] = None, ):
"""
get video all comments include sub comments
:param video_id:
:param crawl_interval:
:param is_fetch_sub_comments:
:param callback:
:return:
"""
result = []
is_end = False
next_page = 0
while not is_end:
comments_res = await self.get_video_comments(video_id, CommentOrderType.DEFAULT, next_page)
cursor_info: Dict = comments_res.get("cursor")
comment_list: List[Dict] = comments_res.get("replies", [])
is_end = cursor_info.get("is_end")
next_page = cursor_info.get("next")
if is_fetch_sub_comments:
for comment in comment_list:
comment_id = comment['rpid']
if (comment.get("rcount", 0) > 0):
{
await self.get_video_all_level_two_comments(
video_id, comment_id, CommentOrderType.DEFAULT, 10, crawl_interval, callback)
}
if callback: # 如果有回调函数,就执行回调函数
await callback(video_id, comment_list)
await asyncio.sleep(crawl_interval)
if not is_fetch_sub_comments:
result.extend(comment_list)
continue
return result
async def get_video_all_level_two_comments(self,
video_id: str,
level_one_comment_id: int,
order_mode: CommentOrderType,
ps: int = 10,
crawl_interval: float = 1.0,
callback: Optional[Callable] = None,
) -> Dict:
"""
get video all level two comments for a level one comment
:param video_id: 视频 ID
:param level_one_comment_id: 一级评论 ID
:param order_mode:
:param ps: 一页评论数
:param crawl_interval:
:param callback:
:return:
"""
pn = 1
while True:
result = await self.get_video_level_two_comments(
video_id, level_one_comment_id, pn, ps, order_mode)
comment_list: List[Dict] = result.get("replies", [])
if callback: # 如果有回调函数,就执行回调函数
await callback(video_id, comment_list)
await asyncio.sleep(crawl_interval)
if (int(result["page"]["count"]) <= pn * ps):
break
pn += 1
async def get_video_level_two_comments(self,
video_id: str,
level_one_comment_id: int,
pn: int,
ps: int,
order_mode: CommentOrderType,
) -> Dict:
"""get video level two comments
:param video_id: 视频 ID
:param level_one_comment_id: 一级评论 ID
:param order_mode: 排序方式
:return:
"""
uri = "/x/v2/reply/reply"
post_data = {
"oid": video_id,
"mode": order_mode.value,
"type": 1,
"ps": ps,
"pn": pn,
"root": level_one_comment_id,
}
result = await self.get(uri, post_data)
return result
async def get_creator_videos(self, creator_id: str, pn: int, ps: int = 30, order_mode: SearchOrderType = SearchOrderType.LAST_PUBLISH) -> Dict:
"""get all videos for a creator
:param creator_id: 创作者 ID
:param pn: 页数
:param ps: 一页视频数
:param order_mode: 排序方式
:return:
"""
uri = "/x/space/wbi/arc/search"
post_data = {
"mid": creator_id,
"pn": pn,
"ps": ps,
"order": order_mode,
}
return await self.get(uri, post_data)

View file

@ -0,0 +1,302 @@
# -*- coding: utf-8 -*-
# @Author : relakkes@gmail.com
# @Time : 2023/12/2 18:44
# @Desc : B站爬虫
import asyncio
import os
import random
from asyncio import Task
from typing import Dict, List, Optional, Tuple
from playwright.async_api import (BrowserContext, BrowserType, Page,
async_playwright)
import config
from base.base_crawler import AbstractCrawler
from proxy.proxy_ip_pool import IpInfoModel, create_ip_pool
from store import bilibili as bilibili_store
from tools import utils
from var import crawler_type_var
from .client import BilibiliClient
from .exception import DataFetchError
from .field import SearchOrderType
from .login import BilibiliLogin
class BilibiliCrawler(AbstractCrawler):
context_page: Page
bili_client: BilibiliClient
browser_context: BrowserContext
def __init__(self):
self.index_url = "https://www.bilibili.com"
self.user_agent = utils.get_user_agent()
async def start(self):
playwright_proxy_format, httpx_proxy_format = None, None
if config.ENABLE_IP_PROXY:
ip_proxy_pool = await create_ip_pool(config.IP_PROXY_POOL_COUNT, enable_validate_ip=True)
ip_proxy_info: IpInfoModel = await ip_proxy_pool.get_proxy()
playwright_proxy_format, httpx_proxy_format = self.format_proxy_info(
ip_proxy_info)
async with async_playwright() as playwright:
# Launch a browser context.
chromium = playwright.chromium
self.browser_context = await self.launch_browser(
chromium,
None,
self.user_agent,
headless=config.HEADLESS
)
# stealth.min.js is a js script to prevent the website from detecting the crawler.
await self.browser_context.add_init_script(path="libs/stealth.min.js")
self.context_page = await self.browser_context.new_page()
await self.context_page.goto(self.index_url)
# Create a client to interact with the xiaohongshu website.
self.bili_client = await self.create_bilibili_client(httpx_proxy_format)
if not await self.bili_client.pong():
login_obj = BilibiliLogin(
login_type=config.LOGIN_TYPE,
login_phone="", # your phone number
browser_context=self.browser_context,
context_page=self.context_page,
cookie_str=config.COOKIES
)
await login_obj.begin()
await self.bili_client.update_cookies(browser_context=self.browser_context)
crawler_type_var.set(config.CRAWLER_TYPE)
if config.CRAWLER_TYPE == "search":
# Search for video and retrieve their comment information.
await self.search()
elif config.CRAWLER_TYPE == "detail":
# Get the information and comments of the specified post
await self.get_specified_videos(config.BILI_SPECIFIED_ID_LIST)
elif config.CRAWLER_TYPE == "creator":
for creator_id in config.BILI_CREATOR_ID_LIST:
await self.get_creator_videos(int(creator_id))
else:
pass
utils.logger.info(
"[BilibiliCrawler.start] Bilibili Crawler finished ...")
async def search(self):
"""
search bilibili video with keywords
:return:
"""
utils.logger.info(
"[BilibiliCrawler.search] Begin search bilibli keywords")
bili_limit_count = 20 # bilibili limit page fixed value
if config.CRAWLER_MAX_NOTES_COUNT < bili_limit_count:
config.CRAWLER_MAX_NOTES_COUNT = bili_limit_count
start_page = config.START_PAGE # start page number
for keyword in config.KEYWORDS.split(","):
utils.logger.info(
f"[BilibiliCrawler.search] Current search keyword: {keyword}")
page = 1
while (page - start_page + 1) * bili_limit_count <= config.CRAWLER_MAX_NOTES_COUNT:
if page < start_page:
utils.logger.info(
f"[BilibiliCrawler.search] Skip page: {page}")
page += 1
continue
utils.logger.info(f"[BilibiliCrawler.search] search bilibili keyword: {keyword}, page: {page}")
video_id_list: List[str] = []
videos_res = await self.bili_client.search_video_by_keyword(
keyword=keyword,
page=page,
page_size=bili_limit_count,
order=SearchOrderType.DEFAULT,
)
video_list: List[Dict] = videos_res.get("result")
semaphore = asyncio.Semaphore(config.MAX_CONCURRENCY_NUM)
task_list = [
self.get_video_info_task(aid=video_item.get(
"aid"), bvid="", semaphore=semaphore)
for video_item in video_list
]
video_items = await asyncio.gather(*task_list)
for video_item in video_items:
if video_item:
video_id_list.append(video_item.get("View").get("aid"))
await bilibili_store.update_bilibili_video(video_item)
page += 1
await self.batch_get_video_comments(video_id_list)
async def batch_get_video_comments(self, video_id_list: List[str]):
"""
batch get video comments
:param video_id_list:
:return:
"""
if not config.ENABLE_GET_COMMENTS:
utils.logger.info(
f"[BilibiliCrawler.batch_get_note_comments] Crawling comment mode is not enabled")
return
utils.logger.info(
f"[BilibiliCrawler.batch_get_video_comments] video ids:{video_id_list}")
semaphore = asyncio.Semaphore(config.MAX_CONCURRENCY_NUM)
task_list: List[Task] = []
for video_id in video_id_list:
task = asyncio.create_task(self.get_comments(
video_id, semaphore), name=video_id)
task_list.append(task)
await asyncio.gather(*task_list)
async def get_comments(self, video_id: str, semaphore: asyncio.Semaphore):
"""
get comment for video id
:param video_id:
:param semaphore:
:return:
"""
async with semaphore:
try:
utils.logger.info(
f"[BilibiliCrawler.get_comments] begin get video_id: {video_id} comments ...")
await self.bili_client.get_video_all_comments(
video_id=video_id,
crawl_interval=random.random(),
is_fetch_sub_comments=config.ENABLE_GET_SUB_COMMENTS,
callback=bilibili_store.batch_update_bilibili_video_comments
)
except DataFetchError as ex:
utils.logger.error(
f"[BilibiliCrawler.get_comments] get video_id: {video_id} comment error: {ex}")
except Exception as e:
utils.logger.error(
f"[BilibiliCrawler.get_comments] may be been blocked, err:{e}")
async def get_creator_videos(self, creator_id: int):
"""
get videos for a creator
:return:
"""
ps = 30
pn = 1
video_bvids_list = []
while True:
result = await self.bili_client.get_creator_videos(creator_id, pn, ps)
for video in result["list"]["vlist"]:
video_bvids_list.append(video["bvid"])
if (int(result["page"]["count"]) <= pn * ps):
break
await asyncio.sleep(random.random())
pn += 1
await self.get_specified_videos(video_bvids_list)
async def get_specified_videos(self, bvids_list: List[str]):
"""
get specified videos info
:return:
"""
semaphore = asyncio.Semaphore(config.MAX_CONCURRENCY_NUM)
task_list = [
self.get_video_info_task(aid=0, bvid=video_id, semaphore=semaphore) for video_id in
bvids_list
]
video_details = await asyncio.gather(*task_list)
video_aids_list = []
for video_detail in video_details:
if video_detail is not None:
video_item_view: Dict = video_detail.get("View")
video_aid: str = video_item_view.get("aid")
if video_aid:
video_aids_list.append(video_aid)
await bilibili_store.update_bilibili_video(video_detail)
await self.batch_get_video_comments(video_aids_list)
async def get_video_info_task(self, aid: int, bvid: str, semaphore: asyncio.Semaphore) -> Optional[Dict]:
"""
Get video detail task
:param aid:
:param bvid:
:param semaphore:
:return:
"""
async with semaphore:
try:
result = await self.bili_client.get_video_info(aid=aid, bvid=bvid)
return result
except DataFetchError as ex:
utils.logger.error(
f"[BilibiliCrawler.get_video_info_task] Get video detail error: {ex}")
return None
except KeyError as ex:
utils.logger.error(
f"[BilibiliCrawler.get_video_info_task] have not fund note detail video_id:{bvid}, err: {ex}")
return None
async def create_bilibili_client(self, httpx_proxy: Optional[str]) -> BilibiliClient:
"""Create xhs client"""
utils.logger.info(
"[BilibiliCrawler.create_bilibili_client] Begin create bilibili API client ...")
cookie_str, cookie_dict = utils.convert_cookies(await self.browser_context.cookies())
bilibili_client_obj = BilibiliClient(
proxies=httpx_proxy,
headers={
"User-Agent": self.user_agent,
"Cookie": cookie_str,
"Origin": "https://www.bilibili.com",
"Referer": "https://www.bilibili.com",
"Content-Type": "application/json;charset=UTF-8"
},
playwright_page=self.context_page,
cookie_dict=cookie_dict,
)
return bilibili_client_obj
@staticmethod
def format_proxy_info(ip_proxy_info: IpInfoModel) -> Tuple[Optional[Dict], Optional[Dict]]:
"""format proxy info for playwright and httpx"""
playwright_proxy = {
"server": f"{ip_proxy_info.protocol}{ip_proxy_info.ip}:{ip_proxy_info.port}",
"username": ip_proxy_info.user,
"password": ip_proxy_info.password,
}
httpx_proxy = {
f"{ip_proxy_info.protocol}": f"http://{ip_proxy_info.user}:{ip_proxy_info.password}@{ip_proxy_info.ip}:{ip_proxy_info.port}"
}
return playwright_proxy, httpx_proxy
async def launch_browser(
self,
chromium: BrowserType,
playwright_proxy: Optional[Dict],
user_agent: Optional[str],
headless: bool = True
) -> BrowserContext:
"""Launch browser and create browser context"""
utils.logger.info(
"[BilibiliCrawler.launch_browser] Begin create browser context ...")
if config.SAVE_LOGIN_STATE:
# feat issue #14
# we will save login state to avoid login every time
user_data_dir = os.path.join(os.getcwd(), "browser_data",
config.USER_DATA_DIR % config.PLATFORM) # type: ignore
browser_context = await chromium.launch_persistent_context(
user_data_dir=user_data_dir,
accept_downloads=True,
headless=headless,
proxy=playwright_proxy, # type: ignore
viewport={"width": 1920, "height": 1080},
user_agent=user_agent
)
return browser_context
else:
# type: ignore
browser = await chromium.launch(headless=headless, proxy=playwright_proxy)
browser_context = await browser.new_context(
viewport={"width": 1920, "height": 1080},
user_agent=user_agent
)
return browser_context

View file

@ -0,0 +1,14 @@
# -*- coding: utf-8 -*-
# @Author : relakkes@gmail.com
# @Time : 2023/12/2 18:44
# @Desc :
from httpx import RequestError
class DataFetchError(RequestError):
"""something error when fetch"""
class IPBlockError(RequestError):
"""fetch so fast that the server block us ip"""

View file

@ -0,0 +1,34 @@
# -*- coding: utf-8 -*-
# @Author : relakkes@gmail.com
# @Time : 2023/12/3 16:20
# @Desc :
from enum import Enum
class SearchOrderType(Enum):
# 综合排序
DEFAULT = ""
# 最多点击
MOST_CLICK = "click"
# 最新发布
LAST_PUBLISH = "pubdate"
# 最多弹幕
MOST_DANMU = "dm"
# 最多收藏
MOST_MARK = "stow"
class CommentOrderType(Enum):
# 仅按热度
DEFAULT = 0
# 按热度+按时间
MIXED = 1
# 按时间
TIME = 2

View file

@ -0,0 +1,70 @@
# -*- coding: utf-8 -*-
# @Author : relakkes@gmail.com
# @Time : 2023/12/2 23:26
# @Desc : bilibili 请求参数签名
# 逆向实现参考https://socialsisteryi.github.io/bilibili-API-collect/docs/misc/sign/wbi.html#wbi%E7%AD%BE%E5%90%8D%E7%AE%97%E6%B3%95
import urllib.parse
from hashlib import md5
from typing import Dict
from tools import utils
class BilibiliSign:
def __init__(self, img_key: str, sub_key: str):
self.img_key = img_key
self.sub_key = sub_key
self.map_table = [
46, 47, 18, 2, 53, 8, 23, 32, 15, 50, 10, 31, 58, 3, 45, 35, 27, 43, 5, 49,
33, 9, 42, 19, 29, 28, 14, 39, 12, 38, 41, 13, 37, 48, 7, 16, 24, 55, 40,
61, 26, 17, 0, 1, 60, 51, 30, 4, 22, 25, 54, 21, 56, 59, 6, 63, 57, 62, 11,
36, 20, 34, 44, 52
]
def get_salt(self) -> str:
"""
获取加盐的 key
:return:
"""
salt = ""
mixin_key = self.img_key + self.sub_key
for mt in self.map_table:
salt += mixin_key[mt]
return salt[:32]
def sign(self, req_data: Dict) -> Dict:
"""
请求参数中加上当前时间戳对请求参数中的key进行字典序排序
再将请求参数进行 url 编码集合 salt 进行 md5 就可以生成w_rid参数了
:param req_data:
:return:
"""
current_ts = utils.get_unix_timestamp()
req_data.update({"wts": current_ts})
req_data = dict(sorted(req_data.items()))
req_data = {
# 过滤 value 中的 "!'()*" 字符
k: ''.join(filter(lambda ch: ch not in "!'()*", str(v)))
for k, v
in req_data.items()
}
query = urllib.parse.urlencode(req_data)
salt = self.get_salt()
wbi_sign = md5((query + salt).encode()).hexdigest() # 计算 w_rid
req_data['w_rid'] = wbi_sign
return req_data
if __name__ == '__main__':
_img_key = "7cd084941338484aae1ad9425b84077c"
_sub_key = "4932caff0ff746eab6f01bf08b70ac45"
_search_url = "__refresh__=true&_extra=&ad_resource=5654&category_id=&context=&dynamic_offset=0&from_source=&from_spmid=333.337&gaia_vtoken=&highlight=1&keyword=python&order=click&page=1&page_size=20&platform=pc&qv_id=OQ8f2qtgYdBV1UoEnqXUNUl8LEDAdzsD&search_type=video&single_column=0&source_tag=3&web_location=1430654"
_req_data = dict()
for params in _search_url.split("&"):
kvalues = params.split("=")
key = kvalues[0]
value = kvalues[1]
_req_data[key] = value
print("pre req_data", _req_data)
_req_data = BilibiliSign(img_key=_img_key, sub_key=_sub_key).sign(req_data={"aid":170001})
print(_req_data)

View file

@ -0,0 +1,107 @@
# -*- coding: utf-8 -*-
# @Author : relakkes@gmail.com
# @Time : 2023/12/2 18:44
# @Desc : bilibli登录实现类
import asyncio
import functools
import sys
from typing import Optional
from playwright.async_api import BrowserContext, Page
from tenacity import (RetryError, retry, retry_if_result, stop_after_attempt,
wait_fixed)
import config
from base.base_crawler import AbstractLogin
from tools import utils
class BilibiliLogin(AbstractLogin):
def __init__(self,
login_type: str,
browser_context: BrowserContext,
context_page: Page,
login_phone: Optional[str] = "",
cookie_str: str = ""
):
config.LOGIN_TYPE = login_type
self.browser_context = browser_context
self.context_page = context_page
self.login_phone = login_phone
self.cookie_str = cookie_str
async def begin(self):
"""Start login bilibili"""
utils.logger.info("[BilibiliLogin.begin] Begin login Bilibili ...")
if config.LOGIN_TYPE == "qrcode":
await self.login_by_qrcode()
elif config.LOGIN_TYPE == "phone":
await self.login_by_mobile()
elif config.LOGIN_TYPE == "cookie":
await self.login_by_cookies()
else:
raise ValueError(
"[BilibiliLogin.begin] Invalid Login Type Currently only supported qrcode or phone or cookie ...")
@retry(stop=stop_after_attempt(600), wait=wait_fixed(1), retry=retry_if_result(lambda value: value is False))
async def check_login_state(self) -> bool:
"""
Check if the current login status is successful and return True otherwise return False
retry decorator will retry 20 times if the return value is False, and the retry interval is 1 second
if max retry times reached, raise RetryError
"""
current_cookie = await self.browser_context.cookies()
_, cookie_dict = utils.convert_cookies(current_cookie)
if cookie_dict.get("SESSDATA", "") or cookie_dict.get("DedeUserID"):
return True
return False
async def login_by_qrcode(self):
"""login bilibili website and keep webdriver login state"""
utils.logger.info("[BilibiliLogin.login_by_qrcode] Begin login bilibili by qrcode ...")
# click login button
login_button_ele = self.context_page.locator(
"xpath=//div[@class='right-entry__outside go-login-btn']//div"
)
await login_button_ele.click()
# find login qrcode
qrcode_img_selector = "//div[@class='login-scan-box']//img"
base64_qrcode_img = await utils.find_login_qrcode(
self.context_page,
selector=qrcode_img_selector
)
if not base64_qrcode_img:
utils.logger.info("[BilibiliLogin.login_by_qrcode] login failed , have not found qrcode please check ....")
sys.exit()
# show login qrcode
partial_show_qrcode = functools.partial(utils.show_qrcode, base64_qrcode_img)
asyncio.get_running_loop().run_in_executor(executor=None, func=partial_show_qrcode)
utils.logger.info(f"[BilibiliLogin.login_by_qrcode] Waiting for scan code login, remaining time is 20s")
try:
await self.check_login_state()
except RetryError:
utils.logger.info("[BilibiliLogin.login_by_qrcode] Login bilibili failed by qrcode login method ...")
sys.exit()
wait_redirect_seconds = 5
utils.logger.info(
f"[BilibiliLogin.login_by_qrcode] Login successful then wait for {wait_redirect_seconds} seconds redirect ...")
await asyncio.sleep(wait_redirect_seconds)
async def login_by_mobile(self):
pass
async def login_by_cookies(self):
utils.logger.info("[BilibiliLogin.login_by_qrcode] Begin login bilibili by cookie ...")
for key, value in utils.convert_str_cookie_to_dict(self.cookie_str).items():
await self.browser_context.add_cookies([{
'name': key,
'value': value,
'domain': ".bilibili.com",
'path': "/"
}])

View file

@ -0,0 +1 @@
from .core import DouYinCrawler

View file

@ -0,0 +1,280 @@
import asyncio
import copy
import json
import urllib.parse
from typing import Any, Callable, Dict, List, Optional
import execjs
import httpx
from playwright.async_api import BrowserContext, Page
from base.base_crawler import AbstractApiClient
from tools import utils
from var import request_keyword_var
from .exception import *
from .field import *
class DOUYINClient(AbstractApiClient):
def __init__(
self,
timeout=30,
proxies=None,
*,
headers: Dict,
playwright_page: Optional[Page],
cookie_dict: Dict
):
self.proxies = proxies
self.timeout = timeout
self.headers = headers
self._host = "https://www.douyin.com"
self.playwright_page = playwright_page
self.cookie_dict = cookie_dict
async def __process_req_params(self, params: Optional[Dict] = None, headers: Optional[Dict] = None):
if not params:
return
headers = headers or self.headers
local_storage: Dict = await self.playwright_page.evaluate("() => window.localStorage") # type: ignore
douyin_js_obj = execjs.compile(open('libs/douyin.js').read())
common_params = {
"device_platform": "webapp",
"aid": "6383",
"channel": "channel_pc_web",
"cookie_enabled": "true",
"browser_language": "zh-CN",
"browser_platform": "Win32",
"browser_name": "Firefox",
"browser_version": "110.0",
"browser_online": "true",
"engine_name": "Gecko",
"os_name": "Windows",
"os_version": "10",
"engine_version": "109.0",
"platform": "PC",
"screen_width": "1920",
"screen_height": "1200",
# " webid": douyin_js_obj.call("get_web_id"),
# "msToken": local_storage.get("xmst"),
# "msToken": "abL8SeUTPa9-EToD8qfC7toScSADxpg6yLh2dbNcpWHzE0bT04txM_4UwquIcRvkRb9IU8sifwgM1Kwf1Lsld81o9Irt2_yNyUbbQPSUO8EfVlZJ_78FckDFnwVBVUVK",
}
params.update(common_params)
query = '&'.join([f'{k}={v}' for k, v in params.items()])
x_bogus = douyin_js_obj.call('sign', query, headers["User-Agent"])
params["X-Bogus"] = x_bogus
# print(x_bogus, query)
async def request(self, method, url, **kwargs):
async with httpx.AsyncClient(proxies=self.proxies) as client:
response = await client.request(
method, url, timeout=self.timeout,
**kwargs
)
try:
return response.json()
except Exception as e:
raise DataFetchError(f"{e}, {response.text}")
async def get(self, uri: str, params: Optional[Dict] = None, headers: Optional[Dict] = None):
await self.__process_req_params(params, headers)
headers = headers or self.headers
return await self.request(method="GET", url=f"{self._host}{uri}", params=params, headers=headers)
async def post(self, uri: str, data: dict, headers: Optional[Dict] = None):
await self.__process_req_params(data, headers)
headers = headers or self.headers
return await self.request(method="POST", url=f"{self._host}{uri}", data=data, headers=headers)
async def pong(self, browser_context: BrowserContext) -> bool:
local_storage = await self.playwright_page.evaluate("() => window.localStorage")
if local_storage.get("HasUserLogin", "") == "1":
return True
_, cookie_dict = utils.convert_cookies(await browser_context.cookies())
return cookie_dict.get("LOGIN_STATUS") == "1"
async def update_cookies(self, browser_context: BrowserContext):
cookie_str, cookie_dict = utils.convert_cookies(await browser_context.cookies())
self.headers["Cookie"] = cookie_str
self.cookie_dict = cookie_dict
async def search_info_by_keyword(
self,
keyword: str,
offset: int = 0,
search_channel: SearchChannelType = SearchChannelType.GENERAL,
sort_type: SearchSortType = SearchSortType.GENERAL,
publish_time: PublishTimeType = PublishTimeType.UNLIMITED
):
"""
DouYin Web Search API
:param keyword:
:param offset:
:param search_channel:
:param sort_type:
:param publish_time: ·
:return:
"""
params = {
"keyword": urllib.parse.quote(keyword),
"search_channel": search_channel.value,
"search_source": "normal_search",
"query_correct_type": 1,
"is_filter_search": 0,
"offset": offset,
"count": 10 # must be set to 10
}
if sort_type != SearchSortType.GENERAL or publish_time != PublishTimeType.UNLIMITED:
params["filter_selected"] = urllib.parse.quote(json.dumps({
"sort_type": str(sort_type.value),
"publish_time": str(publish_time.value)
}))
params["is_filter_search"] = 1
params["search_source"] = "tab_search"
referer_url = "https://www.douyin.com/search/" + keyword
referer_url += f"?publish_time={publish_time.value}&sort_type={sort_type.value}&type=general"
headers = copy.copy(self.headers)
headers["Referer"] = urllib.parse.quote(referer_url, safe=':/')
return await self.get("/aweme/v1/web/general/search/single/", params, headers=headers)
async def get_video_by_id(self, aweme_id: str) -> Any:
"""
DouYin Video Detail API
:param aweme_id:
:return:
"""
params = {
"aweme_id": aweme_id
}
headers = copy.copy(self.headers)
# headers["Cookie"] = "s_v_web_id=verify_lol4a8dv_wpQ1QMyP_xemd_4wON_8Yzr_FJa8DN1vdY2m;"
del headers["Origin"]
res = await self.get("/aweme/v1/web/aweme/detail/", params, headers)
return res.get("aweme_detail", {})
async def get_aweme_comments(self, aweme_id: str, cursor: int = 0):
"""get note comments
"""
uri = "/aweme/v1/web/comment/list/"
params = {
"aweme_id": aweme_id,
"cursor": cursor,
"count": 20,
"item_type": 0
}
keywords = request_keyword_var.get()
referer_url = "https://www.douyin.com/search/" + keywords + '?aid=3a3cec5a-9e27-4040-b6aa-ef548c2c1138&publish_time=0&sort_type=0&source=search_history&type=general'
headers = copy.copy(self.headers)
headers["Referer"] = urllib.parse.quote(referer_url, safe=':/')
return await self.get(uri, params)
async def get_sub_comments(self, comment_id: str, cursor: int = 0):
"""
获取子评论
"""
uri = "/aweme/v1/web/comment/list/reply/"
params = {
'comment_id': comment_id,
"cursor": cursor,
"count": 20,
"item_type": 0,
}
keywords = request_keyword_var.get()
referer_url = "https://www.douyin.com/search/" + keywords + '?aid=3a3cec5a-9e27-4040-b6aa-ef548c2c1138&publish_time=0&sort_type=0&source=search_history&type=general'
headers = copy.copy(self.headers)
headers["Referer"] = urllib.parse.quote(referer_url, safe=':/')
return await self.get(uri, params)
async def get_aweme_all_comments(
self,
aweme_id: str,
crawl_interval: float = 1.0,
is_fetch_sub_comments=False,
callback: Optional[Callable] = None,
):
"""
获取帖子的所有评论包括子评论
:param aweme_id: 帖子ID
:param crawl_interval: 抓取间隔
:param is_fetch_sub_comments: 是否抓取子评论
:param callback: 回调函数用于处理抓取到的评论
:return: 评论列表
"""
result = []
comments_has_more = 1
comments_cursor = 0
while comments_has_more:
comments_res = await self.get_aweme_comments(aweme_id, comments_cursor)
comments_has_more = comments_res.get("has_more", 0)
comments_cursor = comments_res.get("cursor", 0)
comments = comments_res.get("comments", [])
if not comments:
continue
result.extend(comments)
if callback: # 如果有回调函数,就执行回调函数
await callback(aweme_id, comments)
await asyncio.sleep(crawl_interval)
if not is_fetch_sub_comments:
continue
# 获取二级评论
for comment in comments:
reply_comment_total = comment.get("reply_comment_total")
if reply_comment_total > 0:
comment_id = comment.get("cid")
sub_comments_has_more = 1
sub_comments_cursor = 0
while sub_comments_has_more:
sub_comments_res = await self.get_sub_comments(comment_id, sub_comments_cursor)
sub_comments_has_more = sub_comments_res.get("has_more", 0)
sub_comments_cursor = sub_comments_res.get("cursor", 0)
sub_comments = sub_comments_res.get("comments", [])
if not sub_comments:
continue
result.extend(sub_comments)
if callback: # 如果有回调函数,就执行回调函数
await callback(aweme_id, sub_comments)
await asyncio.sleep(crawl_interval)
return result
async def get_user_info(self, sec_user_id: str):
uri = "/aweme/v1/web/user/profile/other/"
params = {
"sec_user_id": sec_user_id,
"publish_video_strategy_type": 2,
"personal_center_strategy": 1,
}
return await self.get(uri, params)
async def get_user_aweme_posts(self, sec_user_id: str, max_cursor: str = "") -> Dict:
uri = "/aweme/v1/web/aweme/post/"
params = {
"sec_user_id": sec_user_id,
"count": 18,
"max_cursor": max_cursor,
"locate_query": "false",
"publish_video_strategy_type": 2
}
return await self.get(uri, params)
async def get_all_user_aweme_posts(self, sec_user_id: str, callback: Optional[Callable] = None):
posts_has_more = 1
max_cursor = ""
result = []
while posts_has_more == 1:
aweme_post_res = await self.get_user_aweme_posts(sec_user_id, max_cursor)
posts_has_more = aweme_post_res.get("has_more", 0)
max_cursor = aweme_post_res.get("max_cursor")
aweme_list = aweme_post_res.get("aweme_list") if aweme_post_res.get("aweme_list") else []
utils.logger.info(
f"[DOUYINClient.get_all_user_aweme_posts] got sec_user_id:{sec_user_id} video len : {len(aweme_list)}")
if callback:
await callback(aweme_list)
result.extend(aweme_list)
return result

View file

@ -0,0 +1,271 @@
import asyncio
import os
import random
from asyncio import Task
from typing import Any, Dict, List, Optional, Tuple
from playwright.async_api import (BrowserContext, BrowserType, Page,
async_playwright)
import config
from base.base_crawler import AbstractCrawler
from proxy.proxy_ip_pool import IpInfoModel, create_ip_pool
from store import douyin as douyin_store
from tools import utils
from var import crawler_type_var
from .client import DOUYINClient
from .exception import DataFetchError
from .field import PublishTimeType
from .login import DouYinLogin
class DouYinCrawler(AbstractCrawler):
context_page: Page
dy_client: DOUYINClient
browser_context: BrowserContext
def __init__(self) -> None:
self.user_agent = "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/109.0.0.0 Safari/537.36" # fixed
self.index_url = "https://www.douyin.com"
async def start(self) -> None:
playwright_proxy_format, httpx_proxy_format = None, None
if config.ENABLE_IP_PROXY:
ip_proxy_pool = await create_ip_pool(config.IP_PROXY_POOL_COUNT, enable_validate_ip=True)
ip_proxy_info: IpInfoModel = await ip_proxy_pool.get_proxy()
playwright_proxy_format, httpx_proxy_format = self.format_proxy_info(ip_proxy_info)
async with async_playwright() as playwright:
# Launch a browser context.
chromium = playwright.chromium
self.browser_context = await self.launch_browser(
chromium,
None,
self.user_agent,
headless=config.HEADLESS
)
# stealth.min.js is a js script to prevent the website from detecting the crawler.
await self.browser_context.add_init_script(path="libs/stealth.min.js")
self.context_page = await self.browser_context.new_page()
await self.context_page.goto(self.index_url)
self.dy_client = await self.create_douyin_client(httpx_proxy_format)
if not await self.dy_client.pong(browser_context=self.browser_context):
login_obj = DouYinLogin(
login_type=config.LOGIN_TYPE,
login_phone="", # you phone number
browser_context=self.browser_context,
context_page=self.context_page,
cookie_str=config.COOKIES
)
await login_obj.begin()
await self.dy_client.update_cookies(browser_context=self.browser_context)
crawler_type_var.set(config.CRAWLER_TYPE)
if config.CRAWLER_TYPE == "search":
# Search for notes and retrieve their comment information.
await self.search()
elif config.CRAWLER_TYPE == "detail":
# Get the information and comments of the specified post
await self.get_specified_awemes()
elif config.CRAWLER_TYPE == "creator":
# Get the information and comments of the specified creator
await self.get_creators_and_videos()
utils.logger.info("[DouYinCrawler.start] Douyin Crawler finished ...")
async def search(self) -> None:
utils.logger.info("[DouYinCrawler.search] Begin search douyin keywords")
dy_limit_count = 10 # douyin limit page fixed value
if config.CRAWLER_MAX_NOTES_COUNT < dy_limit_count:
config.CRAWLER_MAX_NOTES_COUNT = dy_limit_count
start_page = config.START_PAGE # start page number
for keyword in config.KEYWORDS.split(","):
utils.logger.info(f"[DouYinCrawler.search] Current keyword: {keyword}")
aweme_list: List[str] = []
page = 0
while (page - start_page + 1) * dy_limit_count <= config.CRAWLER_MAX_NOTES_COUNT:
if page < start_page:
utils.logger.info(f"[DouYinCrawler.search] Skip {page}")
page += 1
continue
try:
utils.logger.info(f"[DouYinCrawler.search] search douyin keyword: {keyword}, page: {page}")
posts_res = await self.dy_client.search_info_by_keyword(keyword=keyword,
offset=page * dy_limit_count - dy_limit_count,
publish_time=PublishTimeType(config.PUBLISH_TIME_TYPE)
)
except DataFetchError:
utils.logger.error(f"[DouYinCrawler.search] search douyin keyword: {keyword} failed")
break
page += 1
if "data" not in posts_res:
utils.logger.error(
f"[DouYinCrawler.search] search douyin keyword: {keyword} failed账号也许被风控了。")
break
for post_item in posts_res.get("data"):
try:
aweme_info: Dict = post_item.get("aweme_info") or \
post_item.get("aweme_mix_info", {}).get("mix_items")[0]
except TypeError:
continue
aweme_list.append(aweme_info.get("aweme_id", ""))
await douyin_store.update_douyin_aweme(aweme_item=aweme_info)
utils.logger.info(f"[DouYinCrawler.search] keyword:{keyword}, aweme_list:{aweme_list}")
await self.batch_get_note_comments(aweme_list)
async def get_specified_awemes(self):
"""Get the information and comments of the specified post"""
semaphore = asyncio.Semaphore(config.MAX_CONCURRENCY_NUM)
task_list = [
self.get_aweme_detail(aweme_id=aweme_id, semaphore=semaphore) for aweme_id in config.DY_SPECIFIED_ID_LIST
]
aweme_details = await asyncio.gather(*task_list)
for aweme_detail in aweme_details:
if aweme_detail is not None:
await douyin_store.update_douyin_aweme(aweme_detail)
await self.batch_get_note_comments(config.DY_SPECIFIED_ID_LIST)
async def get_aweme_detail(self, aweme_id: str, semaphore: asyncio.Semaphore) -> Any:
"""Get note detail"""
async with semaphore:
try:
return await self.dy_client.get_video_by_id(aweme_id)
except DataFetchError as ex:
utils.logger.error(f"[DouYinCrawler.get_aweme_detail] Get aweme detail error: {ex}")
return None
except KeyError as ex:
utils.logger.error(
f"[DouYinCrawler.get_aweme_detail] have not fund note detail aweme_id:{aweme_id}, err: {ex}")
return None
async def batch_get_note_comments(self, aweme_list: List[str]) -> None:
"""
Batch get note comments
"""
if not config.ENABLE_GET_COMMENTS:
utils.logger.info(f"[DouYinCrawler.batch_get_note_comments] Crawling comment mode is not enabled")
return
task_list: List[Task] = []
semaphore = asyncio.Semaphore(config.MAX_CONCURRENCY_NUM)
for aweme_id in aweme_list:
task = asyncio.create_task(
self.get_comments(aweme_id, semaphore), name=aweme_id)
task_list.append(task)
if len(task_list) > 0:
await asyncio.wait(task_list)
async def get_comments(self, aweme_id: str, semaphore: asyncio.Semaphore) -> None:
async with semaphore:
try:
# 将关键词列表传递给 get_aweme_all_comments 方法
await self.dy_client.get_aweme_all_comments(
aweme_id=aweme_id,
crawl_interval=random.random(),
is_fetch_sub_comments=config.ENABLE_GET_SUB_COMMENTS,
callback=douyin_store.batch_update_dy_aweme_comments
)
utils.logger.info(
f"[DouYinCrawler.get_comments] aweme_id: {aweme_id} comments have all been obtained and filtered ...")
except DataFetchError as e:
utils.logger.error(f"[DouYinCrawler.get_comments] aweme_id: {aweme_id} get comments failed, error: {e}")
async def get_creators_and_videos(self) -> None:
"""
Get the information and videos of the specified creator
"""
utils.logger.info("[DouYinCrawler.get_creators_and_videos] Begin get douyin creators")
for user_id in config.DY_CREATOR_ID_LIST:
creator_info: Dict = await self.dy_client.get_user_info(user_id)
if creator_info:
await douyin_store.save_creator(user_id, creator=creator_info)
# Get all video information of the creator
all_video_list = await self.dy_client.get_all_user_aweme_posts(
sec_user_id=user_id,
callback=self.fetch_creator_video_detail
)
video_ids = [video_item.get("aweme_id") for video_item in all_video_list]
await self.batch_get_note_comments(video_ids)
async def fetch_creator_video_detail(self, video_list: List[Dict]):
"""
Concurrently obtain the specified post list and save the data
"""
semaphore = asyncio.Semaphore(config.MAX_CONCURRENCY_NUM)
task_list = [
self.get_aweme_detail(post_item.get("aweme_id"), semaphore) for post_item in video_list
]
note_details = await asyncio.gather(*task_list)
for aweme_item in note_details:
if aweme_item is not None:
await douyin_store.update_douyin_aweme(aweme_item)
@staticmethod
def format_proxy_info(ip_proxy_info: IpInfoModel) -> Tuple[Optional[Dict], Optional[Dict]]:
"""format proxy info for playwright and httpx"""
playwright_proxy = {
"server": f"{ip_proxy_info.protocol}{ip_proxy_info.ip}:{ip_proxy_info.port}",
"username": ip_proxy_info.user,
"password": ip_proxy_info.password,
}
httpx_proxy = {
f"{ip_proxy_info.protocol}": f"http://{ip_proxy_info.user}:{ip_proxy_info.password}@{ip_proxy_info.ip}:{ip_proxy_info.port}"
}
return playwright_proxy, httpx_proxy
async def create_douyin_client(self, httpx_proxy: Optional[str]) -> DOUYINClient:
"""Create douyin client"""
cookie_str, cookie_dict = utils.convert_cookies(await self.browser_context.cookies()) # type: ignore
douyin_client = DOUYINClient(
proxies=httpx_proxy,
headers={
"User-Agent": self.user_agent,
"Cookie": cookie_str,
"Host": "www.douyin.com",
"Origin": "https://www.douyin.com/",
"Referer": "https://www.douyin.com/",
"Content-Type": "application/json;charset=UTF-8"
},
playwright_page=self.context_page,
cookie_dict=cookie_dict,
)
return douyin_client
async def launch_browser(
self,
chromium: BrowserType,
playwright_proxy: Optional[Dict],
user_agent: Optional[str],
headless: bool = True
) -> BrowserContext:
"""Launch browser and create browser context"""
if config.SAVE_LOGIN_STATE:
user_data_dir = os.path.join(os.getcwd(), "browser_data",
config.USER_DATA_DIR % config.PLATFORM) # type: ignore
browser_context = await chromium.launch_persistent_context(
user_data_dir=user_data_dir,
accept_downloads=True,
headless=headless,
proxy=playwright_proxy, # type: ignore
viewport={"width": 1920, "height": 1080},
user_agent=user_agent
) # type: ignore
return browser_context
else:
browser = await chromium.launch(headless=headless, proxy=playwright_proxy) # type: ignore
browser_context = await browser.new_context(
viewport={"width": 1920, "height": 1080},
user_agent=user_agent
)
return browser_context
async def close(self) -> None:
"""Close browser context"""
await self.browser_context.close()
utils.logger.info("[DouYinCrawler.close] Browser context closed ...")

View file

@ -0,0 +1,9 @@
from httpx import RequestError
class DataFetchError(RequestError):
"""something error when fetch"""
class IPBlockError(RequestError):
"""fetch so fast that the server block us ip"""

View file

@ -0,0 +1,23 @@
from enum import Enum
class SearchChannelType(Enum):
"""search channel type"""
GENERAL = "aweme_general" # 综合
VIDEO = "aweme_video_web" # 视频
USER = "aweme_user_web" # 用户
LIVE = "aweme_live" # 直播
class SearchSortType(Enum):
"""search sort type"""
GENERAL = 0 # 综合排序
MOST_LIKE = 1 # 最多点赞
LATEST = 2 # 最新发布
class PublishTimeType(Enum):
"""publish time type"""
UNLIMITED = 0 # 不限
ONE_DAY = 1 # 一天内
ONE_WEEK = 7 # 一周内
SIX_MONTH = 180 # 半年内

View file

@ -0,0 +1,254 @@
import asyncio
import functools
import sys
from typing import Optional
from playwright.async_api import BrowserContext, Page
from playwright.async_api import TimeoutError as PlaywrightTimeoutError
from tenacity import (RetryError, retry, retry_if_result, stop_after_attempt,
wait_fixed)
import config
from base.base_crawler import AbstractLogin
from cache.cache_factory import CacheFactory
from tools import utils
class DouYinLogin(AbstractLogin):
def __init__(self,
login_type: str,
browser_context: BrowserContext, # type: ignore
context_page: Page, # type: ignore
login_phone: Optional[str] = "",
cookie_str: Optional[str] = ""
):
config.LOGIN_TYPE = login_type
self.browser_context = browser_context
self.context_page = context_page
self.login_phone = login_phone
self.scan_qrcode_time = 60
self.cookie_str = cookie_str
async def begin(self):
"""
Start login douyin website
滑块中间页面的验证准确率不太OK... 如果没有特俗要求建议不开抖音登录或者使用cookies登录
"""
# popup login dialog
await self.popup_login_dialog()
# select login type
if config.LOGIN_TYPE == "qrcode":
await self.login_by_qrcode()
elif config.LOGIN_TYPE == "phone":
await self.login_by_mobile()
elif config.LOGIN_TYPE == "cookie":
await self.login_by_cookies()
else:
raise ValueError("[DouYinLogin.begin] Invalid Login Type Currently only supported qrcode or phone or cookie ...")
# 如果页面重定向到滑动验证码页面,需要再次滑动滑块
await asyncio.sleep(6)
current_page_title = await self.context_page.title()
if "验证码中间页" in current_page_title:
await self.check_page_display_slider(move_step=3, slider_level="hard")
# check login state
utils.logger.info(f"[DouYinLogin.begin] login finished then check login state ...")
try:
await self.check_login_state()
except RetryError:
utils.logger.info("[DouYinLogin.begin] login failed please confirm ...")
sys.exit()
# wait for redirect
wait_redirect_seconds = 5
utils.logger.info(f"[DouYinLogin.begin] Login successful then wait for {wait_redirect_seconds} seconds redirect ...")
await asyncio.sleep(wait_redirect_seconds)
@retry(stop=stop_after_attempt(600), wait=wait_fixed(1), retry=retry_if_result(lambda value: value is False))
async def check_login_state(self):
"""Check if the current login status is successful and return True otherwise return False"""
current_cookie = await self.browser_context.cookies()
_, cookie_dict = utils.convert_cookies(current_cookie)
for page in self.browser_context.pages:
try:
local_storage = await page.evaluate("() => window.localStorage")
if local_storage.get("HasUserLogin", "") == "1":
return True
except Exception as e:
# utils.logger.warn(f"[DouYinLogin] check_login_state waring: {e}")
await asyncio.sleep(0.1)
if cookie_dict.get("LOGIN_STATUS") == "1":
return True
return False
async def popup_login_dialog(self):
"""If the login dialog box does not pop up automatically, we will manually click the login button"""
dialog_selector = "xpath=//div[@id='login-pannel']"
try:
# check dialog box is auto popup and wait for 10 seconds
await self.context_page.wait_for_selector(dialog_selector, timeout=1000 * 10)
except Exception as e:
utils.logger.error(f"[DouYinLogin.popup_login_dialog] login dialog box does not pop up automatically, error: {e}")
utils.logger.info("[DouYinLogin.popup_login_dialog] login dialog box does not pop up automatically, we will manually click the login button")
login_button_ele = self.context_page.locator("xpath=//p[text() = '登录']")
await login_button_ele.click()
await asyncio.sleep(0.5)
async def login_by_qrcode(self):
utils.logger.info("[DouYinLogin.login_by_qrcode] Begin login douyin by qrcode...")
qrcode_img_selector = "xpath=//article[@class='web-login']//img"
base64_qrcode_img = await utils.find_login_qrcode(
self.context_page,
selector=qrcode_img_selector
)
if not base64_qrcode_img:
utils.logger.info("[DouYinLogin.login_by_qrcode] login qrcode not found please confirm ...")
sys.exit()
partial_show_qrcode = functools.partial(utils.show_qrcode, base64_qrcode_img)
asyncio.get_running_loop().run_in_executor(executor=None, func=partial_show_qrcode)
await asyncio.sleep(2)
async def login_by_mobile(self):
utils.logger.info("[DouYinLogin.login_by_mobile] Begin login douyin by mobile ...")
mobile_tap_ele = self.context_page.locator("xpath=//li[text() = '验证码登录']")
await mobile_tap_ele.click()
await self.context_page.wait_for_selector("xpath=//article[@class='web-login-mobile-code']")
mobile_input_ele = self.context_page.locator("xpath=//input[@placeholder='手机号']")
await mobile_input_ele.fill(self.login_phone)
await asyncio.sleep(0.5)
send_sms_code_btn = self.context_page.locator("xpath=//span[text() = '获取验证码']")
await send_sms_code_btn.click()
# 检查是否有滑动验证码
await self.check_page_display_slider(move_step=10, slider_level="easy")
cache_client = CacheFactory.create_cache(config.CACHE_TYPE_MEMORY)
max_get_sms_code_time = 60 * 2 # 最长获取验证码的时间为2分钟
while max_get_sms_code_time > 0:
utils.logger.info(f"[DouYinLogin.login_by_mobile] get douyin sms code from redis remaining time {max_get_sms_code_time}s ...")
await asyncio.sleep(1)
sms_code_key = f"dy_{self.login_phone}"
sms_code_value = cache_client.get(sms_code_key)
if not sms_code_value:
max_get_sms_code_time -= 1
continue
sms_code_input_ele = self.context_page.locator("xpath=//input[@placeholder='请输入验证码']")
await sms_code_input_ele.fill(value=sms_code_value.decode())
await asyncio.sleep(0.5)
submit_btn_ele = self.context_page.locator("xpath=//button[@class='web-login-button']")
await submit_btn_ele.click() # 点击登录
# todo ... 应该还需要检查验证码的正确性有可能输入的验证码不正确
break
async def check_page_display_slider(self, move_step: int = 10, slider_level: str = "easy"):
"""
检查页面是否出现滑动验证码
:return:
"""
# 等待滑动验证码的出现
back_selector = "#captcha-verify-image"
try:
await self.context_page.wait_for_selector(selector=back_selector, state="visible", timeout=30 * 1000)
except PlaywrightTimeoutError: # 没有滑动验证码,直接返回
return
gap_selector = 'xpath=//*[@id="captcha_container"]/div/div[2]/img[2]'
max_slider_try_times = 20
slider_verify_success = False
while not slider_verify_success:
if max_slider_try_times <= 0:
utils.logger.error("[DouYinLogin.check_page_display_slider] slider verify failed ...")
sys.exit()
try:
await self.move_slider(back_selector, gap_selector, move_step, slider_level)
await asyncio.sleep(1)
# 如果滑块滑动慢了,或者验证失败了,会提示操作过慢,这里点一下刷新按钮
page_content = await self.context_page.content()
if "操作过慢" in page_content or "提示重新操作" in page_content:
utils.logger.info("[DouYinLogin.check_page_display_slider] slider verify failed, retry ...")
await self.context_page.click(selector="//a[contains(@class, 'secsdk_captcha_refresh')]")
continue
# 滑动成功后,等待滑块消失
await self.context_page.wait_for_selector(selector=back_selector, state="hidden", timeout=1000)
# 如果滑块消失了,说明验证成功了,跳出循环,如果没有消失,说明验证失败了,上面这一行代码会抛出异常被捕获后继续循环滑动验证码
utils.logger.info("[DouYinLogin.check_page_display_slider] slider verify success ...")
slider_verify_success = True
except Exception as e:
utils.logger.error(f"[DouYinLogin.check_page_display_slider] slider verify failed, error: {e}")
await asyncio.sleep(1)
max_slider_try_times -= 1
utils.logger.info(f"[DouYinLogin.check_page_display_slider] remaining slider try times: {max_slider_try_times}")
continue
async def move_slider(self, back_selector: str, gap_selector: str, move_step: int = 10, slider_level="easy"):
"""
Move the slider to the right to complete the verification
:param back_selector: 滑动验证码背景图片的选择器
:param gap_selector: 滑动验证码的滑块选择器
:param move_step: 是控制单次移动速度的比例是1/10 默认是1 相当于 传入的这个距离不管多远0.1秒钟移动完 越大越慢
:param slider_level: 滑块难度 easy hard,分别对应手机验证码的滑块和验证码中间的滑块
:return:
"""
# get slider background image
slider_back_elements = await self.context_page.wait_for_selector(
selector=back_selector,
timeout=1000 * 10, # wait 10 seconds
)
slide_back = str(await slider_back_elements.get_property("src")) # type: ignore
# get slider gap image
gap_elements = await self.context_page.wait_for_selector(
selector=gap_selector,
timeout=1000 * 10, # wait 10 seconds
)
gap_src = str(await gap_elements.get_property("src")) # type: ignore
# 识别滑块位置
slide_app = utils.Slide(gap=gap_src, bg=slide_back)
distance = slide_app.discern()
# 获取移动轨迹
tracks = utils.get_tracks(distance, slider_level)
new_1 = tracks[-1] - (sum(tracks) - distance)
tracks.pop()
tracks.append(new_1)
# 根据轨迹拖拽滑块到指定位置
element = await self.context_page.query_selector(gap_selector)
bounding_box = await element.bounding_box() # type: ignore
await self.context_page.mouse.move(bounding_box["x"] + bounding_box["width"] / 2, # type: ignore
bounding_box["y"] + bounding_box["height"] / 2) # type: ignore
# 这里获取到x坐标中心点位置
x = bounding_box["x"] + bounding_box["width"] / 2 # type: ignore
# 模拟滑动操作
await element.hover() # type: ignore
await self.context_page.mouse.down()
for track in tracks:
# 循环鼠标按照轨迹移动
# steps 是控制单次移动速度的比例是1/10 默认是1 相当于 传入的这个距离不管多远0.1秒钟移动完 越大越慢
await self.context_page.mouse.move(x + track, 0, steps=move_step)
x += track
await self.context_page.mouse.up()
async def login_by_cookies(self):
utils.logger.info("[DouYinLogin.login_by_cookies] Begin login douyin by cookie ...")
for key, value in utils.convert_str_cookie_to_dict(self.cookie_str).items():
await self.browser_context.add_cookies([{
'name': key,
'value': value,
'domain': ".douyin.com",
'path': "/"
}])

View file

@ -0,0 +1,2 @@
# -*- coding: utf-8 -*-
from .core import KuaishouCrawler

View file

@ -0,0 +1,307 @@
# -*- coding: utf-8 -*-
import asyncio
import json
from typing import Any, Callable, Dict, List, Optional
from urllib.parse import urlencode
import httpx
from playwright.async_api import BrowserContext, Page
import config
from base.base_crawler import AbstractApiClient
from tools import utils
from .exception import DataFetchError
from .graphql import KuaiShouGraphQL
class KuaiShouClient(AbstractApiClient):
def __init__(
self,
timeout=10,
proxies=None,
*,
headers: Dict[str, str],
playwright_page: Page,
cookie_dict: Dict[str, str],
):
self.proxies = proxies
self.timeout = timeout
self.headers = headers
self._host = "https://www.kuaishou.com/graphql"
self.playwright_page = playwright_page
self.cookie_dict = cookie_dict
self.graphql = KuaiShouGraphQL()
async def request(self, method, url, **kwargs) -> Any:
async with httpx.AsyncClient(proxies=self.proxies) as client:
response = await client.request(
method, url, timeout=self.timeout,
**kwargs
)
data: Dict = response.json()
if data.get("errors"):
raise DataFetchError(data.get("errors", "unkonw error"))
else:
return data.get("data", {})
async def get(self, uri: str, params=None) -> Dict:
final_uri = uri
if isinstance(params, dict):
final_uri = (f"{uri}?"
f"{urlencode(params)}")
return await self.request(method="GET", url=f"{self._host}{final_uri}", headers=self.headers)
async def post(self, uri: str, data: dict) -> Dict:
json_str = json.dumps(data, separators=(',', ':'), ensure_ascii=False)
return await self.request(method="POST", url=f"{self._host}{uri}",
data=json_str, headers=self.headers)
async def pong(self) -> bool:
"""get a note to check if login state is ok"""
utils.logger.info("[KuaiShouClient.pong] Begin pong kuaishou...")
ping_flag = False
try:
post_data = {
"operationName": "visionProfileUserList",
"variables": {
"ftype": 1,
},
"query": self.graphql.get("vision_profile_user_list")
}
res = await self.post("", post_data)
if res.get("visionProfileUserList", {}).get("result") == 1:
ping_flag = True
except Exception as e:
utils.logger.error(f"[KuaiShouClient.pong] Pong kuaishou failed: {e}, and try to login again...")
ping_flag = False
return ping_flag
async def update_cookies(self, browser_context: BrowserContext):
cookie_str, cookie_dict = utils.convert_cookies(await browser_context.cookies())
self.headers["Cookie"] = cookie_str
self.cookie_dict = cookie_dict
async def search_info_by_keyword(self, keyword: str, pcursor: str):
"""
KuaiShou web search api
:param keyword: search keyword
:param pcursor: limite page curson
:return:
"""
post_data = {
"operationName": "visionSearchPhoto",
"variables": {
"keyword": keyword,
"pcursor": pcursor,
"page": "search"
},
"query": self.graphql.get("search_query")
}
return await self.post("", post_data)
async def get_video_info(self, photo_id: str) -> Dict:
"""
Kuaishou web video detail api
:param photo_id:
:return:
"""
post_data = {
"operationName": "visionVideoDetail",
"variables": {
"photoId": photo_id,
"page": "search"
},
"query": self.graphql.get("video_detail")
}
return await self.post("", post_data)
async def get_video_comments(self, photo_id: str, pcursor: str = "") -> Dict:
"""get video comments
:param photo_id: photo id you want to fetch
:param pcursor: last you get pcursor, defaults to ""
:return:
"""
post_data = {
"operationName": "commentListQuery",
"variables": {
"photoId": photo_id,
"pcursor": pcursor
},
"query": self.graphql.get("comment_list")
}
return await self.post("", post_data)
async def get_video_sub_comments(
self, photo_id: str, rootCommentId: str, pcursor: str = ""
) -> Dict:
"""get video sub comments
:param photo_id: photo id you want to fetch
:param pcursor: last you get pcursor, defaults to ""
:return:
"""
post_data = {
"operationName": "visionSubCommentList",
"variables": {
"photoId": photo_id,
"pcursor": pcursor,
"rootCommentId": rootCommentId,
},
"query": self.graphql.get("vision_sub_comment_list"),
}
return await self.post("", post_data)
async def get_creator_profile(self, userId: str) -> Dict:
post_data = {
"operationName": "visionProfile",
"variables": {
"userId": userId
},
"query": self.graphql.get("vision_profile"),
}
return await self.post("", post_data)
async def get_video_by_creater(self, userId: str, pcursor: str = "") -> Dict:
post_data = {
"operationName": "visionProfilePhotoList",
"variables": {
"page": "profile",
"pcursor": pcursor,
"userId": userId
},
"query": self.graphql.get("vision_profile_photo_list"),
}
return await self.post("", post_data)
async def get_video_all_comments(
self,
photo_id: str,
crawl_interval: float = 1.0,
callback: Optional[Callable] = None,
):
"""
get video all comments include sub comments
:param photo_id:
:param crawl_interval:
:param callback:
:return:
"""
result = []
pcursor = ""
while pcursor != "no_more":
comments_res = await self.get_video_comments(photo_id, pcursor)
vision_commen_list = comments_res.get("visionCommentList", {})
pcursor = vision_commen_list.get("pcursor", "")
comments = vision_commen_list.get("rootComments", [])
if callback: # 如果有回调函数,就执行回调函数
await callback(photo_id, comments)
result.extend(comments)
await asyncio.sleep(crawl_interval)
sub_comments = await self.get_comments_all_sub_comments(
comments, photo_id, crawl_interval, callback
)
result.extend(sub_comments)
return result
async def get_comments_all_sub_comments(
self,
comments: List[Dict],
photo_id,
crawl_interval: float = 1.0,
callback: Optional[Callable] = None,
) -> List[Dict]:
"""
获取指定一级评论下的所有二级评论, 该方法会一直查找一级评论下的所有二级评论信息
Args:
comments: 评论列表
photo_id: 视频id
crawl_interval: 爬取一次评论的延迟单位
callback: 一次评论爬取结束后
Returns:
"""
if not config.ENABLE_GET_SUB_COMMENTS:
utils.logger.info(
f"[KuaiShouClient.get_comments_all_sub_comments] Crawling sub_comment mode is not enabled"
)
return []
result = []
for comment in comments:
sub_comments = comment.get("subComments")
if sub_comments and callback:
await callback(photo_id, sub_comments)
sub_comment_pcursor = comment.get("subCommentsPcursor")
if sub_comment_pcursor == "no_more":
continue
root_comment_id = comment.get("commentId")
sub_comment_pcursor = ""
while sub_comment_pcursor != "no_more":
comments_res = await self.get_video_sub_comments(
photo_id, root_comment_id, sub_comment_pcursor
)
vision_sub_comment_list = comments_res.get("visionSubCommentList",{})
sub_comment_pcursor = vision_sub_comment_list.get("pcursor", "no_more")
comments = vision_sub_comment_list.get("subComments", {})
if callback:
await callback(photo_id, comments)
await asyncio.sleep(crawl_interval)
result.extend(comments)
return result
async def get_creator_info(self, user_id: str) -> Dict:
"""
eg: https://www.kuaishou.com/profile/3x4jtnbfter525a
快手用户主页
"""
visionProfile = await self.get_creator_profile(user_id)
return visionProfile.get("userProfile")
async def get_all_videos_by_creator(
self,
user_id: str,
crawl_interval: float = 1.0,
callback: Optional[Callable] = None,
) -> List[Dict]:
"""
获取指定用户下的所有发过的帖子该方法会一直查找一个用户下的所有帖子信息
Args:
user_id: 用户ID
crawl_interval: 爬取一次的延迟单位
callback: 一次分页爬取结束后的更新回调函数
Returns:
"""
result = []
pcursor = ""
while pcursor != "no_more":
videos_res = await self.get_video_by_creater(user_id, pcursor)
if not videos_res:
utils.logger.error(
f"[KuaiShouClient.get_all_videos_by_creator] The current creator may have been banned by ks, so they cannot access the data."
)
break
vision_profile_photo_list = videos_res.get("visionProfilePhotoList", {})
pcursor = vision_profile_photo_list.get("pcursor", "")
videos = vision_profile_photo_list.get("feeds", [])
utils.logger.info(
f"[KuaiShouClient.get_all_videos_by_creator] got user_id:{user_id} videos len : {len(videos)}"
)
if callback:
await callback(videos)
await asyncio.sleep(crawl_interval)
result.extend(videos)
return result

View file

@ -0,0 +1,288 @@
import asyncio
import os
import random
import time
from asyncio import Task
from typing import Dict, List, Optional, Tuple
from playwright.async_api import (BrowserContext, BrowserType, Page,
async_playwright)
import config
from base.base_crawler import AbstractCrawler
from proxy.proxy_ip_pool import IpInfoModel, create_ip_pool
from store import kuaishou as kuaishou_store
from tools import utils
from var import comment_tasks_var, crawler_type_var
from .client import KuaiShouClient
from .exception import DataFetchError
from .login import KuaishouLogin
class KuaishouCrawler(AbstractCrawler):
context_page: Page
ks_client: KuaiShouClient
browser_context: BrowserContext
def __init__(self):
self.index_url = "https://www.kuaishou.com"
self.user_agent = utils.get_user_agent()
async def start(self):
playwright_proxy_format, httpx_proxy_format = None, None
if config.ENABLE_IP_PROXY:
ip_proxy_pool = await create_ip_pool(config.IP_PROXY_POOL_COUNT, enable_validate_ip=True)
ip_proxy_info: IpInfoModel = await ip_proxy_pool.get_proxy()
playwright_proxy_format, httpx_proxy_format = self.format_proxy_info(ip_proxy_info)
async with async_playwright() as playwright:
# Launch a browser context.
chromium = playwright.chromium
self.browser_context = await self.launch_browser(
chromium,
None,
self.user_agent,
headless=config.HEADLESS
)
# stealth.min.js is a js script to prevent the website from detecting the crawler.
await self.browser_context.add_init_script(path="libs/stealth.min.js")
self.context_page = await self.browser_context.new_page()
await self.context_page.goto(f"{self.index_url}?isHome=1")
# Create a client to interact with the kuaishou website.
self.ks_client = await self.create_ks_client(httpx_proxy_format)
if not await self.ks_client.pong():
login_obj = KuaishouLogin(
login_type=config.LOGIN_TYPE,
login_phone=httpx_proxy_format,
browser_context=self.browser_context,
context_page=self.context_page,
cookie_str=config.COOKIES
)
await login_obj.begin()
await self.ks_client.update_cookies(browser_context=self.browser_context)
crawler_type_var.set(config.CRAWLER_TYPE)
if config.CRAWLER_TYPE == "search":
# Search for videos and retrieve their comment information.
await self.search()
elif config.CRAWLER_TYPE == "detail":
# Get the information and comments of the specified post
await self.get_specified_videos()
elif config.CRAWLER_TYPE == "creator":
# Get creator's information and their videos and comments
await self.get_creators_and_videos()
else:
pass
utils.logger.info("[KuaishouCrawler.start] Kuaishou Crawler finished ...")
async def search(self):
utils.logger.info("[KuaishouCrawler.search] Begin search kuaishou keywords")
ks_limit_count = 20 # kuaishou limit page fixed value
if config.CRAWLER_MAX_NOTES_COUNT < ks_limit_count:
config.CRAWLER_MAX_NOTES_COUNT = ks_limit_count
start_page = config.START_PAGE
for keyword in config.KEYWORDS.split(","):
utils.logger.info(f"[KuaishouCrawler.search] Current search keyword: {keyword}")
page = 1
while (page - start_page + 1) * ks_limit_count <= config.CRAWLER_MAX_NOTES_COUNT:
if page < start_page:
utils.logger.info(f"[KuaishouCrawler.search] Skip page: {page}")
page += 1
continue
utils.logger.info(f"[KuaishouCrawler.search] search kuaishou keyword: {keyword}, page: {page}")
video_id_list: List[str] = []
videos_res = await self.ks_client.search_info_by_keyword(
keyword=keyword,
pcursor=str(page),
)
if not videos_res:
utils.logger.error(f"[KuaishouCrawler.search] search info by keyword:{keyword} not found data")
continue
vision_search_photo: Dict = videos_res.get("visionSearchPhoto")
if vision_search_photo.get("result") != 1:
utils.logger.error(f"[KuaishouCrawler.search] search info by keyword:{keyword} not found data ")
continue
for video_detail in vision_search_photo.get("feeds"):
video_id_list.append(video_detail.get("photo", {}).get("id"))
await kuaishou_store.update_kuaishou_video(video_item=video_detail)
# batch fetch video comments
page += 1
await self.batch_get_video_comments(video_id_list)
async def get_specified_videos(self):
"""Get the information and comments of the specified post"""
semaphore = asyncio.Semaphore(config.MAX_CONCURRENCY_NUM)
task_list = [
self.get_video_info_task(video_id=video_id, semaphore=semaphore) for video_id in config.KS_SPECIFIED_ID_LIST
]
video_details = await asyncio.gather(*task_list)
for video_detail in video_details:
if video_detail is not None:
await kuaishou_store.update_kuaishou_video(video_detail)
await self.batch_get_video_comments(config.KS_SPECIFIED_ID_LIST)
async def get_video_info_task(self, video_id: str, semaphore: asyncio.Semaphore) -> Optional[Dict]:
"""Get video detail task"""
async with semaphore:
try:
result = await self.ks_client.get_video_info(video_id)
utils.logger.info(f"[KuaishouCrawler.get_video_info_task] Get video_id:{video_id} info result: {result} ...")
return result.get("visionVideoDetail")
except DataFetchError as ex:
utils.logger.error(f"[KuaishouCrawler.get_video_info_task] Get video detail error: {ex}")
return None
except KeyError as ex:
utils.logger.error(f"[KuaishouCrawler.get_video_info_task] have not fund video detail video_id:{video_id}, err: {ex}")
return None
async def batch_get_video_comments(self, video_id_list: List[str]):
"""
batch get video comments
:param video_id_list:
:return:
"""
if not config.ENABLE_GET_COMMENTS:
utils.logger.info(f"[KuaishouCrawler.batch_get_video_comments] Crawling comment mode is not enabled")
return
utils.logger.info(f"[KuaishouCrawler.batch_get_video_comments] video ids:{video_id_list}")
semaphore = asyncio.Semaphore(config.MAX_CONCURRENCY_NUM)
task_list: List[Task] = []
for video_id in video_id_list:
task = asyncio.create_task(self.get_comments(video_id, semaphore), name=video_id)
task_list.append(task)
comment_tasks_var.set(task_list)
await asyncio.gather(*task_list)
async def get_comments(self, video_id: str, semaphore: asyncio.Semaphore):
"""
get comment for video id
:param video_id:
:param semaphore:
:return:
"""
async with semaphore:
try:
utils.logger.info(f"[KuaishouCrawler.get_comments] begin get video_id: {video_id} comments ...")
await self.ks_client.get_video_all_comments(
photo_id=video_id,
crawl_interval=random.random(),
callback=kuaishou_store.batch_update_ks_video_comments
)
except DataFetchError as ex:
utils.logger.error(f"[KuaishouCrawler.get_comments] get video_id: {video_id} comment error: {ex}")
except Exception as e:
utils.logger.error(f"[KuaishouCrawler.get_comments] may be been blocked, err:{e}")
# use time.sleeep block main coroutine instead of asyncio.sleep and cacel running comment task
# maybe kuaishou block our request, we will take a nap and update the cookie again
current_running_tasks = comment_tasks_var.get()
for task in current_running_tasks:
task.cancel()
time.sleep(20)
await self.context_page.goto(f"{self.index_url}?isHome=1")
await self.ks_client.update_cookies(browser_context=self.browser_context)
@staticmethod
def format_proxy_info(ip_proxy_info: IpInfoModel) -> Tuple[Optional[Dict], Optional[Dict]]:
"""format proxy info for playwright and httpx"""
playwright_proxy = {
"server": f"{ip_proxy_info.protocol}{ip_proxy_info.ip}:{ip_proxy_info.port}",
"username": ip_proxy_info.user,
"password": ip_proxy_info.password,
}
httpx_proxy = {
f"{ip_proxy_info.protocol}": f"http://{ip_proxy_info.user}:{ip_proxy_info.password}@{ip_proxy_info.ip}:{ip_proxy_info.port}"
}
return playwright_proxy, httpx_proxy
async def create_ks_client(self, httpx_proxy: Optional[str]) -> KuaiShouClient:
"""Create ks client"""
utils.logger.info("[KuaishouCrawler.create_ks_client] Begin create kuaishou API client ...")
cookie_str, cookie_dict = utils.convert_cookies(await self.browser_context.cookies())
ks_client_obj = KuaiShouClient(
proxies=httpx_proxy,
headers={
"User-Agent": self.user_agent,
"Cookie": cookie_str,
"Origin": self.index_url,
"Referer": self.index_url,
"Content-Type": "application/json;charset=UTF-8"
},
playwright_page=self.context_page,
cookie_dict=cookie_dict,
)
return ks_client_obj
async def launch_browser(
self,
chromium: BrowserType,
playwright_proxy: Optional[Dict],
user_agent: Optional[str],
headless: bool = True
) -> BrowserContext:
"""Launch browser and create browser context"""
utils.logger.info("[KuaishouCrawler.launch_browser] Begin create browser context ...")
if config.SAVE_LOGIN_STATE:
user_data_dir = os.path.join(os.getcwd(), "browser_data",
config.USER_DATA_DIR % config.PLATFORM) # type: ignore
browser_context = await chromium.launch_persistent_context(
user_data_dir=user_data_dir,
accept_downloads=True,
headless=headless,
proxy=playwright_proxy, # type: ignore
viewport={"width": 1920, "height": 1080},
user_agent=user_agent
)
return browser_context
else:
browser = await chromium.launch(headless=headless, proxy=playwright_proxy) # type: ignore
browser_context = await browser.new_context(
viewport={"width": 1920, "height": 1080},
user_agent=user_agent
)
return browser_context
async def get_creators_and_videos(self) -> None:
"""Get creator's videos and retrieve their comment information."""
utils.logger.info("[KuaiShouCrawler.get_creators_and_videos] Begin get kuaishou creators")
for user_id in config.KS_CREATOR_ID_LIST:
# get creator detail info from web html content
createor_info: Dict = await self.ks_client.get_creator_info(user_id=user_id)
if createor_info:
await kuaishou_store.save_creator(user_id, creator=createor_info)
# Get all video information of the creator
all_video_list = await self.ks_client.get_all_videos_by_creator(
user_id = user_id,
crawl_interval = random.random(),
callback = self.fetch_creator_video_detail
)
video_ids = [video_item.get("photo", {}).get("id") for video_item in all_video_list]
await self.batch_get_video_comments(video_ids)
async def fetch_creator_video_detail(self, video_list: List[Dict]):
"""
Concurrently obtain the specified post list and save the data
"""
semaphore = asyncio.Semaphore(config.MAX_CONCURRENCY_NUM)
task_list = [
self.get_video_info_task(post_item.get("photo", {}).get("id"), semaphore) for post_item in video_list
]
video_details = await asyncio.gather(*task_list)
for video_detail in video_details:
if video_detail is not None:
await kuaishou_store.update_kuaishou_video(video_detail)
async def close(self):
"""Close browser context"""
await self.browser_context.close()
utils.logger.info("[KuaishouCrawler.close] Browser context closed ...")

View file

@ -0,0 +1,9 @@
from httpx import RequestError
class DataFetchError(RequestError):
"""something error when fetch"""
class IPBlockError(RequestError):
"""fetch so fast that the server block us ip"""

View file

@ -0,0 +1 @@
# -*- coding: utf-8 -*-

View file

@ -0,0 +1,22 @@
# 快手的数据传输是基于GraphQL实现的
# 这个类负责获取一些GraphQL的schema
from typing import Dict
class KuaiShouGraphQL:
graphql_queries: Dict[str, str]= {}
def __init__(self):
self.graphql_dir = "media_platform/kuaishou/graphql/"
self.load_graphql_queries()
def load_graphql_queries(self):
graphql_files = ["search_query.graphql", "video_detail.graphql", "comment_list.graphql", "vision_profile.graphql","vision_profile_photo_list.graphql","vision_profile_user_list.graphql","vision_sub_comment_list.graphql"]
for file in graphql_files:
with open(self.graphql_dir + file, mode="r") as f:
query_name = file.split(".")[0]
self.graphql_queries[query_name] = f.read()
def get(self, query_name: str) -> str:
return self.graphql_queries.get(query_name, "Query not found")

View file

@ -0,0 +1,39 @@
query commentListQuery($photoId: String, $pcursor: String) {
visionCommentList(photoId: $photoId, pcursor: $pcursor) {
commentCount
pcursor
rootComments {
commentId
authorId
authorName
content
headurl
timestamp
likedCount
realLikedCount
liked
status
authorLiked
subCommentCount
subCommentsPcursor
subComments {
commentId
authorId
authorName
content
headurl
timestamp
likedCount
realLikedCount
liked
status
authorLiked
replyToUserName
replyTo
__typename
}
__typename
}
__typename
}
}

View file

@ -0,0 +1,111 @@
fragment photoContent on PhotoEntity {
__typename
id
duration
caption
originCaption
likeCount
viewCount
commentCount
realLikeCount
coverUrl
photoUrl
photoH265Url
manifest
manifestH265
videoResource
coverUrls {
url
__typename
}
timestamp
expTag
animatedCoverUrl
distance
videoRatio
liked
stereoType
profileUserTopPhoto
musicBlocked
}
fragment recoPhotoFragment on recoPhotoEntity {
__typename
id
duration
caption
originCaption
likeCount
viewCount
commentCount
realLikeCount
coverUrl
photoUrl
photoH265Url
manifest
manifestH265
videoResource
coverUrls {
url
__typename
}
timestamp
expTag
animatedCoverUrl
distance
videoRatio
liked
stereoType
profileUserTopPhoto
musicBlocked
}
fragment feedContent on Feed {
type
author {
id
name
headerUrl
following
headerUrls {
url
__typename
}
__typename
}
photo {
...photoContent
...recoPhotoFragment
__typename
}
canAddComment
llsid
status
currentPcursor
tags {
type
name
__typename
}
__typename
}
query visionSearchPhoto($keyword: String, $pcursor: String, $searchSessionId: String, $page: String, $webPageArea: String) {
visionSearchPhoto(keyword: $keyword, pcursor: $pcursor, searchSessionId: $searchSessionId, page: $page, webPageArea: $webPageArea) {
result
llsid
webPageArea
feeds {
...feedContent
__typename
}
searchSessionId
pcursor
aladdinBanner {
imgUrl
link
__typename
}
__typename
}
}

View file

@ -0,0 +1,80 @@
query visionVideoDetail($photoId: String, $type: String, $page: String, $webPageArea: String) {
visionVideoDetail(photoId: $photoId, type: $type, page: $page, webPageArea: $webPageArea) {
status
type
author {
id
name
following
headerUrl
__typename
}
photo {
id
duration
caption
likeCount
realLikeCount
coverUrl
photoUrl
liked
timestamp
expTag
llsid
viewCount
videoRatio
stereoType
musicBlocked
manifest {
mediaType
businessType
version
adaptationSet {
id
duration
representation {
id
defaultSelect
backupUrl
codecs
url
height
width
avgBitrate
maxBitrate
m3u8Slice
qualityType
qualityLabel
frameRate
featureP2sp
hidden
disableAdaptive
__typename
}
__typename
}
__typename
}
manifestH265
photoH265Url
coronaCropManifest
coronaCropManifestH265
croppedPhotoH265Url
croppedPhotoUrl
videoResource
__typename
}
tags {
type
name
__typename
}
commentLimit {
canAddComment
__typename
}
llsid
danmakuSwitch
__typename
}
}

View file

@ -0,0 +1,27 @@
query visionProfile($userId: String) {
visionProfile(userId: $userId) {
result
hostName
userProfile {
ownerCount {
fan
photo
follow
photo_public
__typename
}
profile {
gender
user_name
user_id
headurl
user_text
user_profile_bg_url
__typename
}
isFollowing
__typename
}
__typename
}
}

View file

@ -0,0 +1,110 @@
fragment photoContent on PhotoEntity {
__typename
id
duration
caption
originCaption
likeCount
viewCount
commentCount
realLikeCount
coverUrl
photoUrl
photoH265Url
manifest
manifestH265
videoResource
coverUrls {
url
__typename
}
timestamp
expTag
animatedCoverUrl
distance
videoRatio
liked
stereoType
profileUserTopPhoto
musicBlocked
riskTagContent
riskTagUrl
}
fragment recoPhotoFragment on recoPhotoEntity {
__typename
id
duration
caption
originCaption
likeCount
viewCount
commentCount
realLikeCount
coverUrl
photoUrl
photoH265Url
manifest
manifestH265
videoResource
coverUrls {
url
__typename
}
timestamp
expTag
animatedCoverUrl
distance
videoRatio
liked
stereoType
profileUserTopPhoto
musicBlocked
riskTagContent
riskTagUrl
}
fragment feedContent on Feed {
type
author {
id
name
headerUrl
following
headerUrls {
url
__typename
}
__typename
}
photo {
...photoContent
...recoPhotoFragment
__typename
}
canAddComment
llsid
status
currentPcursor
tags {
type
name
__typename
}
__typename
}
query visionProfilePhotoList($pcursor: String, $userId: String, $page: String, $webPageArea: String) {
visionProfilePhotoList(pcursor: $pcursor, userId: $userId, page: $page, webPageArea: $webPageArea) {
result
llsid
webPageArea
feeds {
...feedContent
__typename
}
hostName
pcursor
__typename
}
}

View file

@ -0,0 +1,16 @@
query visionProfileUserList($pcursor: String, $ftype: Int) {
visionProfileUserList(pcursor: $pcursor, ftype: $ftype) {
result
fols {
user_name
headurl
user_text
isFollowing
user_id
__typename
}
hostName
pcursor
__typename
}
}

View file

@ -0,0 +1,22 @@
mutation visionSubCommentList($photoId: String, $rootCommentId: String, $pcursor: String) {
visionSubCommentList(photoId: $photoId, rootCommentId: $rootCommentId, pcursor: $pcursor) {
pcursor
subComments {
commentId
authorId
authorName
content
headurl
timestamp
likedCount
realLikedCount
liked
status
authorLiked
replyToUserName
replyTo
__typename
}
__typename
}
}

View file

@ -0,0 +1,102 @@
import asyncio
import functools
import sys
from typing import Optional
from playwright.async_api import BrowserContext, Page
from tenacity import (RetryError, retry, retry_if_result, stop_after_attempt,
wait_fixed)
import config
from base.base_crawler import AbstractLogin
from tools import utils
class KuaishouLogin(AbstractLogin):
def __init__(self,
login_type: str,
browser_context: BrowserContext,
context_page: Page,
login_phone: Optional[str] = "",
cookie_str: str = ""
):
config.LOGIN_TYPE = login_type
self.browser_context = browser_context
self.context_page = context_page
self.login_phone = login_phone
self.cookie_str = cookie_str
async def begin(self):
"""Start login xiaohongshu"""
utils.logger.info("[KuaishouLogin.begin] Begin login kuaishou ...")
if config.LOGIN_TYPE == "qrcode":
await self.login_by_qrcode()
elif config.LOGIN_TYPE == "phone":
await self.login_by_mobile()
elif config.LOGIN_TYPE == "cookie":
await self.login_by_cookies()
else:
raise ValueError("[KuaishouLogin.begin] Invalid Login Type Currently only supported qrcode or phone or cookie ...")
@retry(stop=stop_after_attempt(600), wait=wait_fixed(1), retry=retry_if_result(lambda value: value is False))
async def check_login_state(self) -> bool:
"""
Check if the current login status is successful and return True otherwise return False
retry decorator will retry 20 times if the return value is False, and the retry interval is 1 second
if max retry times reached, raise RetryError
"""
current_cookie = await self.browser_context.cookies()
_, cookie_dict = utils.convert_cookies(current_cookie)
kuaishou_pass_token = cookie_dict.get("passToken")
if kuaishou_pass_token:
return True
return False
async def login_by_qrcode(self):
"""login kuaishou website and keep webdriver login state"""
utils.logger.info("[KuaishouLogin.login_by_qrcode] Begin login kuaishou by qrcode ...")
# click login button
login_button_ele = self.context_page.locator(
"xpath=//p[text()='登录']"
)
await login_button_ele.click()
# find login qrcode
qrcode_img_selector = "//div[@class='qrcode-img']//img"
base64_qrcode_img = await utils.find_login_qrcode(
self.context_page,
selector=qrcode_img_selector
)
if not base64_qrcode_img:
utils.logger.info("[KuaishouLogin.login_by_qrcode] login failed , have not found qrcode please check ....")
sys.exit()
# show login qrcode
partial_show_qrcode = functools.partial(utils.show_qrcode, base64_qrcode_img)
asyncio.get_running_loop().run_in_executor(executor=None, func=partial_show_qrcode)
utils.logger.info(f"[KuaishouLogin.login_by_qrcode] waiting for scan code login, remaining time is 20s")
try:
await self.check_login_state()
except RetryError:
utils.logger.info("[KuaishouLogin.login_by_qrcode] Login kuaishou failed by qrcode login method ...")
sys.exit()
wait_redirect_seconds = 5
utils.logger.info(f"[KuaishouLogin.login_by_qrcode] Login successful then wait for {wait_redirect_seconds} seconds redirect ...")
await asyncio.sleep(wait_redirect_seconds)
async def login_by_mobile(self):
pass
async def login_by_cookies(self):
utils.logger.info("[KuaishouLogin.login_by_cookies] Begin login kuaishou by cookie ...")
for key, value in utils.convert_str_cookie_to_dict(self.cookie_str).items():
await self.browser_context.add_cookies([{
'name': key,
'value': value,
'domain': ".kuaishou.com",
'path': "/"
}])

View file

@ -0,0 +1,7 @@
# -*- coding: utf-8 -*-
# @Author : relakkes@gmail.com
# @Time : 2023/12/23 15:40
# @Desc :
from .client import WeiboClient
from .core import WeiboCrawler
from .login import WeiboLogin

View file

@ -0,0 +1,206 @@
# -*- coding: utf-8 -*-
# @Author : relakkes@gmail.com
# @Time : 2023/12/23 15:40
# @Desc : 微博爬虫 API 请求 client
import asyncio
import copy
import json
import re
from typing import Any, Callable, Dict, List, Optional
from urllib.parse import urlencode
import httpx
from playwright.async_api import BrowserContext, Page
from tools import utils
from .exception import DataFetchError
from .field import SearchType
class WeiboClient:
def __init__(
self,
timeout=10,
proxies=None,
*,
headers: Dict[str, str],
playwright_page: Page,
cookie_dict: Dict[str, str],
):
self.proxies = proxies
self.timeout = timeout
self.headers = headers
self._host = "https://m.weibo.cn"
self.playwright_page = playwright_page
self.cookie_dict = cookie_dict
self._image_agent_host = "https://i1.wp.com/"
async def request(self, method, url, **kwargs) -> Any:
async with httpx.AsyncClient(proxies=self.proxies) as client:
response = await client.request(
method, url, timeout=self.timeout,
**kwargs
)
data: Dict = response.json()
if data.get("ok") != 1:
utils.logger.error(f"[WeiboClient.request] request {method}:{url} err, res:{data}")
raise DataFetchError(data.get("msg", "unkonw error"))
else:
return data.get("data", {})
async def get(self, uri: str, params=None, headers=None) -> Dict:
final_uri = uri
if isinstance(params, dict):
final_uri = (f"{uri}?"
f"{urlencode(params)}")
if headers is None:
headers = self.headers
return await self.request(method="GET", url=f"{self._host}{final_uri}", headers=headers)
async def post(self, uri: str, data: dict) -> Dict:
json_str = json.dumps(data, separators=(',', ':'), ensure_ascii=False)
return await self.request(method="POST", url=f"{self._host}{uri}",
data=json_str, headers=self.headers)
async def pong(self) -> bool:
"""get a note to check if login state is ok"""
utils.logger.info("[WeiboClient.pong] Begin pong weibo...")
ping_flag = False
try:
uri = "/api/config"
resp_data: Dict = await self.request(method="GET", url=f"{self._host}{uri}", headers=self.headers)
if resp_data.get("login"):
ping_flag = True
else:
utils.logger.error(f"[WeiboClient.pong] cookie may be invalid and again login...")
except Exception as e:
utils.logger.error(f"[WeiboClient.pong] Pong weibo failed: {e}, and try to login again...")
ping_flag = False
return ping_flag
async def update_cookies(self, browser_context: BrowserContext):
cookie_str, cookie_dict = utils.convert_cookies(await browser_context.cookies())
self.headers["Cookie"] = cookie_str
self.cookie_dict = cookie_dict
async def get_note_by_keyword(
self,
keyword: str,
page: int = 1,
search_type: SearchType = SearchType.DEFAULT
) -> Dict:
"""
search note by keyword
:param keyword: 微博搜搜的关键词
:param page: 分页参数 -当前页码
:param search_type: 搜索的类型 weibo/filed.py 中的枚举SearchType
:return:
"""
uri = "/api/container/getIndex"
containerid = f"100103type={search_type.value}&q={keyword}"
params = {
"containerid": containerid,
"page_type": "searchall",
"page": page,
}
return await self.get(uri, params)
async def get_note_comments(self, mid_id: str, max_id: int) -> Dict:
"""get notes comments
:param mid_id: 微博ID
:param max_id: 分页参数ID
:return:
"""
uri = "/comments/hotflow"
params = {
"id": mid_id,
"mid": mid_id,
"max_id_type": 0,
}
if max_id > 0:
params.update({"max_id": max_id})
referer_url = f"https://m.weibo.cn/detail/{mid_id}"
headers = copy.copy(self.headers)
headers["Referer"] = referer_url
return await self.get(uri, params, headers=headers)
async def get_note_all_comments(self, note_id: str, crawl_interval: float = 1.0, is_fetch_sub_comments=False,
callback: Optional[Callable] = None, ):
"""
get note all comments include sub comments
:param note_id:
:param crawl_interval:
:param is_fetch_sub_comments:
:param callback:
:return:
"""
result = []
is_end = False
max_id = -1
while not is_end:
comments_res = await self.get_note_comments(note_id, max_id)
max_id: int = comments_res.get("max_id")
comment_list: List[Dict] = comments_res.get("data", [])
is_end = max_id == 0
if callback: # 如果有回调函数,就执行回调函数
await callback(note_id, comment_list)
await asyncio.sleep(crawl_interval)
if not is_fetch_sub_comments:
result.extend(comment_list)
continue
# todo handle get sub comments
return result
async def get_note_info_by_id(self, note_id: str) -> Dict:
"""
根据帖子ID获取详情
:param note_id:
:return:
"""
url = f"{self._host}/detail/{note_id}"
async with httpx.AsyncClient(proxies=self.proxies) as client:
response = await client.request(
"GET", url, timeout=self.timeout, headers=self.headers
)
if response.status_code != 200:
raise DataFetchError(f"get weibo detail err: {response.text}")
match = re.search(r'var \$render_data = (\[.*?\])\[0\]', response.text, re.DOTALL)
if match:
render_data_json = match.group(1)
render_data_dict = json.loads(render_data_json)
note_detail = render_data_dict[0].get("status")
note_item = {
"mblog": note_detail
}
return note_item
else:
utils.logger.info(f"[WeiboClient.get_note_info_by_id] 未找到$render_data的值")
return dict()
async def get_note_image(self, image_url: str) -> bytes:
image_url = image_url[8:] # 去掉 https://
sub_url = image_url.split("/")
image_url = ""
for i in range(len(sub_url)):
if i == 1:
image_url += "large/" #都获取高清大图
elif i == len(sub_url) - 1:
image_url += sub_url[i]
else:
image_url += sub_url[i] + "/"
# 微博图床对外存在防盗链,所以需要代理访问
# 由于微博图片是通过 i1.wp.com 来访问的,所以需要拼接一下
final_uri = (f"{self._image_agent_host}" f"{image_url}")
async with httpx.AsyncClient(proxies=self.proxies) as client:
response = await client.request("GET", final_uri, timeout=self.timeout)
if not response.reason_phrase == "OK":
utils.logger.error(f"[WeiboClient.get_note_image] request {final_uri} err, res:{response.text}")
return None
else:
return response.content

View file

@ -0,0 +1,283 @@
# -*- coding: utf-8 -*-
# @Author : relakkes@gmail.com
# @Time : 2023/12/23 15:41
# @Desc : 微博爬虫主流程代码
import asyncio
import os
import random
from asyncio import Task
from typing import Dict, List, Optional, Tuple
from playwright.async_api import (BrowserContext, BrowserType, Page,
async_playwright)
import config
from base.base_crawler import AbstractCrawler
from proxy.proxy_ip_pool import IpInfoModel, create_ip_pool
from store import weibo as weibo_store
from tools import utils
from var import crawler_type_var
from .client import WeiboClient
from .exception import DataFetchError
from .field import SearchType
from .help import filter_search_result_card
from .login import WeiboLogin
class WeiboCrawler(AbstractCrawler):
context_page: Page
wb_client: WeiboClient
browser_context: BrowserContext
def __init__(self):
self.index_url = "https://www.weibo.com"
self.mobile_index_url = "https://m.weibo.cn"
self.user_agent = utils.get_user_agent()
self.mobile_user_agent = utils.get_mobile_user_agent()
async def start(self):
playwright_proxy_format, httpx_proxy_format = None, None
if config.ENABLE_IP_PROXY:
ip_proxy_pool = await create_ip_pool(config.IP_PROXY_POOL_COUNT, enable_validate_ip=True)
ip_proxy_info: IpInfoModel = await ip_proxy_pool.get_proxy()
playwright_proxy_format, httpx_proxy_format = self.format_proxy_info(ip_proxy_info)
async with async_playwright() as playwright:
# Launch a browser context.
chromium = playwright.chromium
self.browser_context = await self.launch_browser(
chromium,
None,
self.mobile_user_agent,
headless=config.HEADLESS
)
# stealth.min.js is a js script to prevent the website from detecting the crawler.
await self.browser_context.add_init_script(path="libs/stealth.min.js")
self.context_page = await self.browser_context.new_page()
await self.context_page.goto(self.mobile_index_url)
# Create a client to interact with the xiaohongshu website.
self.wb_client = await self.create_weibo_client(httpx_proxy_format)
if not await self.wb_client.pong():
login_obj = WeiboLogin(
login_type=config.LOGIN_TYPE,
login_phone="", # your phone number
browser_context=self.browser_context,
context_page=self.context_page,
cookie_str=config.COOKIES
)
await self.context_page.goto(self.index_url)
await asyncio.sleep(1)
await login_obj.begin()
# 登录成功后重定向到手机端的网站再更新手机端登录成功的cookie
utils.logger.info("[WeiboCrawler.start] redirect weibo mobile homepage and update cookies on mobile platform")
await self.context_page.goto(self.mobile_index_url)
await asyncio.sleep(2)
await self.wb_client.update_cookies(browser_context=self.browser_context)
crawler_type_var.set(config.CRAWLER_TYPE)
if config.CRAWLER_TYPE == "search":
# Search for video and retrieve their comment information.
await self.search()
elif config.CRAWLER_TYPE == "detail":
# Get the information and comments of the specified post
await self.get_specified_notes()
else:
pass
utils.logger.info("[WeiboCrawler.start] Weibo Crawler finished ...")
async def search(self):
"""
search weibo note with keywords
:return:
"""
utils.logger.info("[WeiboCrawler.search] Begin search weibo keywords")
weibo_limit_count = 10 # weibo limit page fixed value
if config.CRAWLER_MAX_NOTES_COUNT < weibo_limit_count:
config.CRAWLER_MAX_NOTES_COUNT = weibo_limit_count
start_page = config.START_PAGE
for keyword in config.KEYWORDS.split(","):
utils.logger.info(f"[WeiboCrawler.search] Current search keyword: {keyword}")
page = 1
while (page - start_page + 1) * weibo_limit_count <= config.CRAWLER_MAX_NOTES_COUNT:
if page < start_page:
utils.logger.info(f"[WeiboCrawler.search] Skip page: {page}")
page += 1
continue
utils.logger.info(f"[WeiboCrawler.search] search weibo keyword: {keyword}, page: {page}")
search_res = await self.wb_client.get_note_by_keyword(
keyword=keyword,
page=page,
search_type=SearchType.DEFAULT
)
note_id_list: List[str] = []
note_list = filter_search_result_card(search_res.get("cards"))
for note_item in note_list:
if note_item:
mblog: Dict = note_item.get("mblog")
if mblog:
note_id_list.append(mblog.get("id"))
await weibo_store.update_weibo_note(note_item)
await self.get_note_images(mblog)
page += 1
await self.batch_get_notes_comments(note_id_list)
async def get_specified_notes(self):
"""
get specified notes info
:return:
"""
semaphore = asyncio.Semaphore(config.MAX_CONCURRENCY_NUM)
task_list = [
self.get_note_info_task(note_id=note_id, semaphore=semaphore) for note_id in
config.WEIBO_SPECIFIED_ID_LIST
]
video_details = await asyncio.gather(*task_list)
for note_item in video_details:
if note_item:
await weibo_store.update_weibo_note(note_item)
await self.batch_get_notes_comments(config.WEIBO_SPECIFIED_ID_LIST)
async def get_note_info_task(self, note_id: str, semaphore: asyncio.Semaphore) -> Optional[Dict]:
"""
Get note detail task
:param note_id:
:param semaphore:
:return:
"""
async with semaphore:
try:
result = await self.wb_client.get_note_info_by_id(note_id)
return result
except DataFetchError as ex:
utils.logger.error(f"[WeiboCrawler.get_note_info_task] Get note detail error: {ex}")
return None
except KeyError as ex:
utils.logger.error(
f"[WeiboCrawler.get_note_info_task] have not fund note detail note_id:{note_id}, err: {ex}")
return None
async def batch_get_notes_comments(self, note_id_list: List[str]):
"""
batch get notes comments
:param note_id_list:
:return:
"""
if not config.ENABLE_GET_COMMENTS:
utils.logger.info(f"[WeiboCrawler.batch_get_note_comments] Crawling comment mode is not enabled")
return
utils.logger.info(f"[WeiboCrawler.batch_get_notes_comments] note ids:{note_id_list}")
semaphore = asyncio.Semaphore(config.MAX_CONCURRENCY_NUM)
task_list: List[Task] = []
for note_id in note_id_list:
task = asyncio.create_task(self.get_note_comments(note_id, semaphore), name=note_id)
task_list.append(task)
await asyncio.gather(*task_list)
async def get_note_comments(self, note_id: str, semaphore: asyncio.Semaphore):
"""
get comment for note id
:param note_id:
:param semaphore:
:return:
"""
async with semaphore:
try:
utils.logger.info(f"[WeiboCrawler.get_note_comments] begin get note_id: {note_id} comments ...")
await self.wb_client.get_note_all_comments(
note_id=note_id,
crawl_interval=random.randint(1,10), # 微博对API的限流比较严重所以延时提高一些
callback=weibo_store.batch_update_weibo_note_comments
)
except DataFetchError as ex:
utils.logger.error(f"[WeiboCrawler.get_note_comments] get note_id: {note_id} comment error: {ex}")
except Exception as e:
utils.logger.error(f"[WeiboCrawler.get_note_comments] may be been blocked, err:{e}")
async def get_note_images(self, mblog: Dict):
"""
get note images
:param mblog:
:return:
"""
if not config.ENABLE_GET_IMAGES:
utils.logger.info(f"[WeiboCrawler.get_note_images] Crawling image mode is not enabled")
return
pics: Dict = mblog.get("pics")
if not pics:
return
for pic in pics:
url = pic.get("url")
if not url:
continue
content = await self.wb_client.get_note_image(url)
if content != None:
extension_file_name = url.split(".")[-1]
await weibo_store.update_weibo_note_image(pic["pid"], content, extension_file_name)
async def create_weibo_client(self, httpx_proxy: Optional[str]) -> WeiboClient:
"""Create xhs client"""
utils.logger.info("[WeiboCrawler.create_weibo_client] Begin create weibo API client ...")
cookie_str, cookie_dict = utils.convert_cookies(await self.browser_context.cookies())
weibo_client_obj = WeiboClient(
proxies=httpx_proxy,
headers={
"User-Agent": utils.get_mobile_user_agent(),
"Cookie": cookie_str,
"Origin": "https://m.weibo.cn",
"Referer": "https://m.weibo.cn",
"Content-Type": "application/json;charset=UTF-8"
},
playwright_page=self.context_page,
cookie_dict=cookie_dict,
)
return weibo_client_obj
@staticmethod
def format_proxy_info(ip_proxy_info: IpInfoModel) -> Tuple[Optional[Dict], Optional[Dict]]:
"""format proxy info for playwright and httpx"""
playwright_proxy = {
"server": f"{ip_proxy_info.protocol}{ip_proxy_info.ip}:{ip_proxy_info.port}",
"username": ip_proxy_info.user,
"password": ip_proxy_info.password,
}
httpx_proxy = {
f"{ip_proxy_info.protocol}": f"http://{ip_proxy_info.user}:{ip_proxy_info.password}@{ip_proxy_info.ip}:{ip_proxy_info.port}"
}
return playwright_proxy, httpx_proxy
async def launch_browser(
self,
chromium: BrowserType,
playwright_proxy: Optional[Dict],
user_agent: Optional[str],
headless: bool = True
) -> BrowserContext:
"""Launch browser and create browser context"""
utils.logger.info("[WeiboCrawler.launch_browser] Begin create browser context ...")
if config.SAVE_LOGIN_STATE:
user_data_dir = os.path.join(os.getcwd(), "browser_data",
config.USER_DATA_DIR % config.PLATFORM) # type: ignore
browser_context = await chromium.launch_persistent_context(
user_data_dir=user_data_dir,
accept_downloads=True,
headless=headless,
proxy=playwright_proxy, # type: ignore
viewport={"width": 1920, "height": 1080},
user_agent=user_agent
)
return browser_context
else:
browser = await chromium.launch(headless=headless, proxy=playwright_proxy) # type: ignore
browser_context = await browser.new_context(
viewport={"width": 1920, "height": 1080},
user_agent=user_agent
)
return browser_context

View file

@ -0,0 +1,14 @@
# -*- coding: utf-8 -*-
# @Author : relakkes@gmail.com
# @Time : 2023/12/2 18:44
# @Desc :
from httpx import RequestError
class DataFetchError(RequestError):
"""something error when fetch"""
class IPBlockError(RequestError):
"""fetch so fast that the server block us ip"""

View file

@ -0,0 +1,19 @@
# -*- coding: utf-8 -*-
# @Author : relakkes@gmail.com
# @Time : 2023/12/23 15:41
# @Desc :
from enum import Enum
class SearchType(Enum):
# 综合
DEFAULT = "1"
# 实时
REAL_TIME = "61"
# 热门
POPULAR = "60"
# 视频
VIDEO = "64"

View file

@ -0,0 +1,25 @@
# -*- coding: utf-8 -*-
# @Author : relakkes@gmail.com
# @Time : 2023/12/24 17:37
# @Desc :
from typing import Dict, List
def filter_search_result_card(card_list: List[Dict]) -> List[Dict]:
"""
过滤微博搜索的结果只保留card_type为9类型的数据
:param card_list:
:return:
"""
note_list: List[Dict] = []
for card_item in card_list:
if card_item.get("card_type") == 9:
note_list.append(card_item)
if len(card_item.get("card_group", [])) > 0:
card_group = card_item.get("card_group")
for card_group_item in card_group:
if card_group_item.get("card_type") == 9:
note_list.append(card_group_item)
return note_list

View file

@ -0,0 +1,137 @@
# -*- coding: utf-8 -*-
# @Author : relakkes@gmail.com
# @Time : 2023/12/23 15:42
# @Desc : 微博登录实现
import asyncio
import functools
import sys
from typing import Optional
from playwright.async_api import BrowserContext, Page
from tenacity import (RetryError, retry, retry_if_result, stop_after_attempt,
wait_fixed)
import config
from base.base_crawler import AbstractLogin
from tools import utils
class WeiboLogin(AbstractLogin):
def __init__(self,
login_type: str,
browser_context: BrowserContext,
context_page: Page,
login_phone: Optional[str] = "",
cookie_str: str = ""
):
config.LOGIN_TYPE = login_type
self.browser_context = browser_context
self.context_page = context_page
self.login_phone = login_phone
self.cookie_str = cookie_str
async def begin(self):
"""Start login weibo"""
utils.logger.info("[WeiboLogin.begin] Begin login weibo ...")
if config.LOGIN_TYPE == "qrcode":
await self.login_by_qrcode()
elif config.LOGIN_TYPE == "phone":
await self.login_by_mobile()
elif config.LOGIN_TYPE == "cookie":
await self.login_by_cookies()
else:
raise ValueError(
"[WeiboLogin.begin] Invalid Login Type Currently only supported qrcode or phone or cookie ...")
@retry(stop=stop_after_attempt(600), wait=wait_fixed(1), retry=retry_if_result(lambda value: value is False))
async def check_login_state(self, no_logged_in_session: str) -> bool:
"""
Check if the current login status is successful and return True otherwise return False
retry decorator will retry 20 times if the return value is False, and the retry interval is 1 second
if max retry times reached, raise RetryError
"""
current_cookie = await self.browser_context.cookies()
_, cookie_dict = utils.convert_cookies(current_cookie)
current_web_session = cookie_dict.get("WBPSESS")
if current_web_session != no_logged_in_session:
return True
return False
async def popup_login_dialog(self):
"""If the login dialog box does not pop up automatically, we will manually click the login button"""
dialog_selector = "xpath=//div[@class='woo-modal-main']"
try:
# check dialog box is auto popup and wait for 4 seconds
await self.context_page.wait_for_selector(dialog_selector, timeout=1000 * 4)
except Exception as e:
utils.logger.error(
f"[WeiboLogin.popup_login_dialog] login dialog box does not pop up automatically, error: {e}")
utils.logger.info(
"[WeiboLogin.popup_login_dialog] login dialog box does not pop up automatically, we will manually click the login button")
# 向下滚动1000像素
await self.context_page.mouse.wheel(0,500)
await asyncio.sleep(0.5)
try:
# click login button
login_button_ele = self.context_page.locator(
"xpath=//a[text()='登录']",
)
await login_button_ele.click()
await asyncio.sleep(0.5)
except Exception as e:
utils.logger.info(f"[WeiboLogin.popup_login_dialog] manually click the login button faield maybe login dialog Appear{e}")
async def login_by_qrcode(self):
"""login weibo website and keep webdriver login state"""
utils.logger.info("[WeiboLogin.login_by_qrcode] Begin login weibo by qrcode ...")
await self.popup_login_dialog()
# find login qrcode
qrcode_img_selector = "//div[@class='woo-modal-main']//img"
base64_qrcode_img = await utils.find_login_qrcode(
self.context_page,
selector=qrcode_img_selector
)
if not base64_qrcode_img:
utils.logger.info("[WeiboLogin.login_by_qrcode] login failed , have not found qrcode please check ....")
sys.exit()
# show login qrcode
partial_show_qrcode = functools.partial(utils.show_qrcode, base64_qrcode_img)
asyncio.get_running_loop().run_in_executor(executor=None, func=partial_show_qrcode)
utils.logger.info(f"[WeiboLogin.login_by_qrcode] Waiting for scan code login, remaining time is 20s")
# get not logged session
current_cookie = await self.browser_context.cookies()
_, cookie_dict = utils.convert_cookies(current_cookie)
no_logged_in_session = cookie_dict.get("WBPSESS")
try:
await self.check_login_state(no_logged_in_session)
except RetryError:
utils.logger.info("[WeiboLogin.login_by_qrcode] Login weibo failed by qrcode login method ...")
sys.exit()
wait_redirect_seconds = 5
utils.logger.info(
f"[WeiboLogin.login_by_qrcode] Login successful then wait for {wait_redirect_seconds} seconds redirect ...")
await asyncio.sleep(wait_redirect_seconds)
async def login_by_mobile(self):
pass
async def login_by_cookies(self):
utils.logger.info("[WeiboLogin.login_by_qrcode] Begin login weibo by cookie ...")
for key, value in utils.convert_str_cookie_to_dict(self.cookie_str).items():
await self.browser_context.add_cookies([{
'name': key,
'value': value,
'domain': ".weibo.cn",
'path': "/"
}])

View file

@ -0,0 +1,2 @@
from .core import XiaoHongShuCrawler
from .field import *

View file

@ -0,0 +1,419 @@
import asyncio
import json
import re
from typing import Any, Callable, Dict, List, Optional, Union
from urllib.parse import urlencode
from bs4 import BeautifulSoup
import httpx
from playwright.async_api import BrowserContext, Page
import config
from base.base_crawler import AbstractApiClient
from tools import utils
from .exception import DataFetchError, IPBlockError
from .field import SearchNoteType, SearchSortType
from .help import get_search_id, sign
class XiaoHongShuClient(AbstractApiClient):
def __init__(
self,
timeout=10,
proxies=None,
*,
headers: Dict[str, str],
playwright_page: Page,
cookie_dict: Dict[str, str],
):
self.proxies = proxies
self.timeout = timeout
self.headers = headers
self._host = "https://edith.xiaohongshu.com"
self._domain = "https://www.xiaohongshu.com"
self.IP_ERROR_STR = "网络连接异常,请检查网络设置或重启试试"
self.IP_ERROR_CODE = 300012
self.NOTE_ABNORMAL_STR = "笔记状态异常,请稍后查看"
self.NOTE_ABNORMAL_CODE = -510001
self.playwright_page = playwright_page
self.cookie_dict = cookie_dict
async def _pre_headers(self, url: str, data=None) -> Dict:
"""
请求头参数签名
Args:
url:
data:
Returns:
"""
encrypt_params = await self.playwright_page.evaluate("([url, data]) => window._webmsxyw(url,data)", [url, data])
local_storage = await self.playwright_page.evaluate("() => window.localStorage")
signs = sign(
a1=self.cookie_dict.get("a1", ""),
b1=local_storage.get("b1", ""),
x_s=encrypt_params.get("X-s", ""),
x_t=str(encrypt_params.get("X-t", ""))
)
headers = {
"X-S": signs["x-s"],
"X-T": signs["x-t"],
"x-S-Common": signs["x-s-common"],
"X-B3-Traceid": signs["x-b3-traceid"]
}
self.headers.update(headers)
return self.headers
async def request(self, method, url, **kwargs) -> Union[str, Any]:
"""
封装httpx的公共请求方法对请求响应做一些处理
Args:
method: 请求方法
url: 请求的URL
**kwargs: 其他请求参数例如请求头请求体等
Returns:
"""
# return response.text
return_response = kwargs.pop('return_response', False)
async with httpx.AsyncClient(proxies=self.proxies) as client:
response = await client.request(
method, url, timeout=self.timeout,
**kwargs
)
if return_response:
return response.text
data: Dict = response.json()
if data["success"]:
return data.get("data", data.get("success", {}))
elif data["code"] == self.IP_ERROR_CODE:
raise IPBlockError(self.IP_ERROR_STR)
else:
raise DataFetchError(data.get("msg", None))
async def get(self, uri: str, params=None) -> Dict:
"""
GET请求对请求头签名
Args:
uri: 请求路由
params: 请求参数
Returns:
"""
final_uri = uri
if isinstance(params, dict):
final_uri = (f"{uri}?"
f"{urlencode(params)}")
headers = await self._pre_headers(final_uri)
return await self.request(method="GET", url=f"{self._host}{final_uri}", headers=headers)
async def post(self, uri: str, data: dict) -> Dict:
"""
POST请求对请求头签名
Args:
uri: 请求路由
data: 请求体参数
Returns:
"""
headers = await self._pre_headers(uri, data)
json_str = json.dumps(data, separators=(',', ':'), ensure_ascii=False)
return await self.request(method="POST", url=f"{self._host}{uri}",
data=json_str, headers=headers)
async def pong(self) -> bool:
"""
用于检查登录态是否失效了
Returns:
"""
"""get a note to check if login state is ok"""
utils.logger.info("[XiaoHongShuClient.pong] Begin to pong xhs...")
ping_flag = False
try:
note_card: Dict = await self.get_note_by_keyword(keyword="小红书")
if note_card.get("items"):
ping_flag = True
except Exception as e:
utils.logger.error(f"[XiaoHongShuClient.pong] Ping xhs failed: {e}, and try to login again...")
ping_flag = False
return ping_flag
async def update_cookies(self, browser_context: BrowserContext):
"""
API客户端提供的更新cookies方法一般情况下登录成功后会调用此方法
Args:
browser_context: 浏览器上下文对象
Returns:
"""
cookie_str, cookie_dict = utils.convert_cookies(await browser_context.cookies())
self.headers["Cookie"] = cookie_str
self.cookie_dict = cookie_dict
async def get_note_by_keyword(
self, keyword: str,
page: int = 1, page_size: int = 20,
sort: SearchSortType = SearchSortType.GENERAL,
note_type: SearchNoteType = SearchNoteType.ALL
) -> Dict:
"""
根据关键词搜索笔记
Args:
keyword: 关键词参数
page: 分页第几页
page_size: 分页数据长度
sort: 搜索结果排序指定
note_type: 搜索的笔记类型
Returns:
"""
uri = "/api/sns/web/v1/search/notes"
data = {
"keyword": keyword,
"page": page,
"page_size": page_size,
"search_id": get_search_id(),
"sort": sort.value,
"note_type": note_type.value
}
return await self.post(uri, data)
async def get_note_by_id(self, note_id: str) -> Dict:
"""
获取笔记详情API
Args:
note_id:笔记ID
Returns:
"""
data = {"source_note_id": note_id}
uri = "/api/sns/web/v1/feed"
res = await self.post(uri, data)
if res and res.get("items"):
res_dict: Dict = res["items"][0]["note_card"]
return res_dict
utils.logger.error(f"[XiaoHongShuClient.get_note_by_id] get note empty and res:{res}")
return dict()
async def get_note_comments(self, note_id: str, cursor: str = "") -> Dict:
"""
获取一级评论的API
Args:
note_id: 笔记ID
cursor: 分页游标
Returns:
"""
uri = "/api/sns/web/v2/comment/page"
params = {
"note_id": note_id,
"cursor": cursor,
"top_comment_id": "",
"image_formats": "jpg,webp,avif"
}
return await self.get(uri, params)
async def get_note_sub_comments(self, note_id: str, root_comment_id: str, num: int = 10, cursor: str = ""):
"""
获取指定父评论下的子评论的API
Args:
note_id: 子评论的帖子ID
root_comment_id: 根评论ID
num: 分页数量
cursor: 分页游标
Returns:
"""
uri = "/api/sns/web/v2/comment/sub/page"
params = {
"note_id": note_id,
"root_comment_id": root_comment_id,
"num": num,
"cursor": cursor,
}
return await self.get(uri, params)
async def get_note_all_comments(self, note_id: str, crawl_interval: float = 1.0,
callback: Optional[Callable] = None) -> List[Dict]:
"""
获取指定笔记下的所有一级评论该方法会一直查找一个帖子下的所有评论信息
Args:
note_id: 笔记ID
crawl_interval: 爬取一次笔记的延迟单位
callback: 一次笔记爬取结束后
Returns:
"""
result = []
comments_has_more = True
comments_cursor = ""
while comments_has_more:
comments_res = await self.get_note_comments(note_id, comments_cursor)
comments_has_more = comments_res.get("has_more", False)
comments_cursor = comments_res.get("cursor", "")
if "comments" not in comments_res:
utils.logger.info(
f"[XiaoHongShuClient.get_note_all_comments] No 'comments' key found in response: {comments_res}")
break
comments = comments_res["comments"]
if callback:
await callback(note_id, comments)
await asyncio.sleep(crawl_interval)
result.extend(comments)
sub_comments = await self.get_comments_all_sub_comments(comments, crawl_interval, callback)
result.extend(sub_comments)
return result
async def get_comments_all_sub_comments(self, comments: List[Dict], crawl_interval: float = 1.0,
callback: Optional[Callable] = None) -> List[Dict]:
"""
获取指定一级评论下的所有二级评论, 该方法会一直查找一级评论下的所有二级评论信息
Args:
comments: 评论列表
crawl_interval: 爬取一次评论的延迟单位
callback: 一次评论爬取结束后
Returns:
"""
if not config.ENABLE_GET_SUB_COMMENTS:
utils.logger.info(f"[XiaoHongShuCrawler.get_comments_all_sub_comments] Crawling sub_comment mode is not enabled")
return []
result = []
for comment in comments:
note_id = comment.get("note_id")
sub_comments = comment.get("sub_comments")
if sub_comments and callback:
await callback(note_id, sub_comments)
sub_comment_has_more = comment.get("sub_comment_has_more")
if not sub_comment_has_more:
continue
root_comment_id = comment.get("id")
sub_comment_cursor = comment.get("sub_comment_cursor")
while sub_comment_has_more:
comments_res = await self.get_note_sub_comments(note_id, root_comment_id, 10, sub_comment_cursor)
sub_comment_has_more = comments_res.get("has_more", False)
sub_comment_cursor = comments_res.get("cursor", "")
if "comments" not in comments_res:
utils.logger.info(
f"[XiaoHongShuClient.get_comments_all_sub_comments] No 'comments' key found in response: {comments_res}")
break
comments = comments_res["comments"]
if callback:
await callback(note_id, comments)
await asyncio.sleep(crawl_interval)
result.extend(comments)
return result
async def get_explore_id(self) -> list:
uri = f"/explore"
html_content = await self.request("GET", self._domain + uri, return_response=True, headers=self.headers)
soup = BeautifulSoup(html_content, 'html.parser')
div = soup.find('div', class_='feeds-container')
section_list = div.find_all('section')
explore_id = []
for s in section_list:
a = s.find('a')
id_url = a['href']
tmp_list = id_url.split('/')
assert len(tmp_list) == 3
id = tmp_list[2]
explore_id.append(id)
return explore_id
async def get_creator_info(self, user_id: str) -> Dict:
"""
通过解析网页版的用户主页HTML获取用户个人简要信息
PC端用户主页的网页存在window.__INITIAL_STATE__这个变量上的解析它即可
eg: https://www.xiaohongshu.com/user/profile/59d8cb33de5fb4696bf17217
"""
uri = f"/user/profile/{user_id}"
html_content = await self.request("GET", self._domain + uri, return_response=True, headers=self.headers)
match = re.search(r'<script>window.__INITIAL_STATE__=(.+)<\/script>', html_content, re.M)
if match is None:
return {}
info = json.loads(match.group(1).replace(':undefined', ':null'), strict=False)
if info is None:
return {}
return info.get('user').get('userPageData')
async def get_notes_by_creator(
self, creator: str,
cursor: str,
page_size: int = 30
) -> Dict:
"""
获取博主的笔记
Args:
creator: 博主ID
cursor: 上一页最后一条笔记的ID
page_size: 分页数据长度
Returns:
"""
uri = "/api/sns/web/v1/user_posted"
data = {
"user_id": creator,
"cursor": cursor,
"num": page_size,
"image_formats": "jpg,webp,avif"
}
return await self.get(uri, data)
async def get_all_notes_by_creator(self, user_id: str, crawl_interval: float = 1.0,
callback: Optional[Callable] = None) -> List[Dict]:
"""
获取指定用户下的所有发过的帖子该方法会一直查找一个用户下的所有帖子信息
Args:
user_id: 用户ID
crawl_interval: 爬取一次的延迟单位
callback: 一次分页爬取结束后的更新回调函数
Returns:
"""
result = []
notes_has_more = True
notes_cursor = ""
while notes_has_more:
notes_res = await self.get_notes_by_creator(user_id, notes_cursor)
if not notes_res:
utils.logger.error(f"[XiaoHongShuClient.get_notes_by_creator] The current creator may have been banned by xhs, so they cannot access the data.")
break
notes_has_more = notes_res.get("has_more", False)
notes_cursor = notes_res.get("cursor", "")
if "notes" not in notes_res:
utils.logger.info(f"[XiaoHongShuClient.get_all_notes_by_creator] No 'notes' key found in response: {notes_res}")
break
notes = notes_res["notes"]
utils.logger.info(f"[XiaoHongShuClient.get_all_notes_by_creator] got user_id:{user_id} notes len : {len(notes)}")
if callback:
await callback(notes)
await asyncio.sleep(crawl_interval)
result.extend(notes)
return result

300
media_platform/xhs/core.py Normal file
View file

@ -0,0 +1,300 @@
import asyncio
import os
import random
from asyncio import Task
import time
from typing import Dict, List, Optional, Tuple
from playwright.async_api import (BrowserContext, BrowserType, Page,
async_playwright)
import config
from base.base_crawler import AbstractCrawler
from proxy.proxy_ip_pool import IpInfoModel, create_ip_pool
from store import xhs as xhs_store
from tools import utils
from var import crawler_type_var
from .client import XiaoHongShuClient
from .exception import DataFetchError
from .field import SearchSortType
from .login import XiaoHongShuLogin
class XiaoHongShuCrawler(AbstractCrawler):
context_page: Page
xhs_client: XiaoHongShuClient
browser_context: BrowserContext
def __init__(self) -> None:
self.index_url = "https://www.xiaohongshu.com"
self.user_agent = utils.get_user_agent()
async def start(self) -> None:
playwright_proxy_format, httpx_proxy_format = None, None
if config.ENABLE_IP_PROXY:
ip_proxy_pool = await create_ip_pool(config.IP_PROXY_POOL_COUNT, enable_validate_ip=True)
ip_proxy_info: IpInfoModel = await ip_proxy_pool.get_proxy()
playwright_proxy_format, httpx_proxy_format = self.format_proxy_info(ip_proxy_info)
async with async_playwright() as playwright:
# Launch a browser context.
chromium = playwright.chromium
self.browser_context = await self.launch_browser(
chromium,
None,
self.user_agent,
headless=config.HEADLESS
)
# stealth.min.js is a js script to prevent the website from detecting the crawler.
await self.browser_context.add_init_script(path="libs/stealth.min.js")
# add a cookie attribute webId to avoid the appearance of a sliding captcha on the webpage
await self.browser_context.add_cookies([{
'name': "webId",
'value': "xxx123", # any value
'domain': ".xiaohongshu.com",
'path': "/"
}])
self.context_page = await self.browser_context.new_page()
await self.context_page.goto(self.index_url)
# Create a client to interact with the xiaohongshu website.
self.xhs_client = await self.create_xhs_client(httpx_proxy_format)
if not await self.xhs_client.pong():
login_obj = XiaoHongShuLogin(
login_type=config.LOGIN_TYPE,
login_phone="", # input your phone number
browser_context=self.browser_context,
context_page=self.context_page,
cookie_str=config.COOKIES
)
await login_obj.begin()
await self.xhs_client.update_cookies(browser_context=self.browser_context)
crawler_type_var.set(config.CRAWLER_TYPE)
if config.CRAWLER_TYPE == "search":
# Search for notes and retrieve their comment information.
await self.search()
elif config.CRAWLER_TYPE == "detail":
# Get the information and comments of the specified post
await self.get_specified_notes()
elif config.CRAWLER_TYPE == "creator":
# Get creator's information and their notes and comments
await self.get_creators_and_notes()
elif config.CRAWLER_TYPE == "explore":
await self.get_explore()
else:
pass
utils.logger.info("[XiaoHongShuCrawler.start] Xhs Crawler finished ...")
async def search(self) -> None:
"""Search for notes and retrieve their comment information."""
utils.logger.info("[XiaoHongShuCrawler.search] Begin search xiaohongshu keywords")
xhs_limit_count = 20 # xhs limit page fixed value
if config.CRAWLER_MAX_NOTES_COUNT < xhs_limit_count:
config.CRAWLER_MAX_NOTES_COUNT = xhs_limit_count
start_page = config.START_PAGE
for keyword in config.KEYWORDS.split(","):
utils.logger.info(f"[XiaoHongShuCrawler.search] Current search keyword: {keyword}")
page = 1
while (page - start_page + 1) * xhs_limit_count <= config.CRAWLER_MAX_NOTES_COUNT:
if page < start_page:
utils.logger.info(f"[XiaoHongShuCrawler.search] Skip page {page}")
page += 1
continue
try:
utils.logger.info(f"[XiaoHongShuCrawler.search] search xhs keyword: {keyword}, page: {page}")
note_id_list: List[str] = []
notes_res = await self.xhs_client.get_note_by_keyword(
keyword=keyword,
page=page,
sort=SearchSortType(config.SORT_TYPE) if config.SORT_TYPE != '' else SearchSortType.GENERAL,
)
utils.logger.info(f"[XiaoHongShuCrawler.search] Search notes res:{notes_res}")
semaphore = asyncio.Semaphore(config.MAX_CONCURRENCY_NUM)
task_list = [
self.get_note_detail(post_item.get("id"), semaphore)
for post_item in notes_res.get("items", {})
if post_item.get('model_type') not in ('rec_query', 'hot_query')
]
note_details = await asyncio.gather(*task_list)
for note_detail in note_details:
if note_detail is not None:
await xhs_store.update_xhs_note(note_detail)
note_id_list.append(note_detail.get("note_id"))
page += 1
utils.logger.info(f"[XiaoHongShuCrawler.search] Note details: {note_details}")
await self.batch_get_note_comments(note_id_list)
except DataFetchError:
utils.logger.error("[XiaoHongShuCrawler.search] Get note detail error")
break
async def get_explore(self) -> None:
explore_id = await self.xhs_client.get_explore_id()
print("[+]GET explore content:")
for id in explore_id:
note_info = await self.xhs_client.get_note_by_id(id)
ip_location = note_info['ip_location']
last_update_time = str(note_info['last_update_time'])
user_name = note_info['user']['nickname']
user_id = note_info['user']['user_id']
content = note_info['desc']
timeArray = time.localtime(int(last_update_time[:-3]))
otherStyleTime = time.strftime("%Y-%m-%d %H:%M:%S", timeArray)
if(len(content) <= 40):
show = content.replace("\n","\\n")
else:
show = content[:40].replace("\n","\\n") + "..."
print(f"[*]IP:{ip_location},Update Time:{otherStyleTime},User Name:{user_name},Content:{show}")
async def get_creators_and_notes(self) -> None:
"""Get creator's notes and retrieve their comment information."""
utils.logger.info("[XiaoHongShuCrawler.get_creators_and_notes] Begin get xiaohongshu creators")
for user_id in config.XHS_CREATOR_ID_LIST:
# get creator detail info from web html content
createor_info: Dict = await self.xhs_client.get_creator_info(user_id=user_id)
if createor_info:
await xhs_store.save_creator(user_id, creator=createor_info)
# Get all note information of the creator
all_notes_list = await self.xhs_client.get_all_notes_by_creator(
user_id=user_id,
crawl_interval=random.random(),
callback=self.fetch_creator_notes_detail
)
note_ids = [note_item.get("note_id") for note_item in all_notes_list]
# print("note_ids:",note_ids)
await self.batch_get_note_comments(note_ids)
async def fetch_creator_notes_detail(self, note_list: List[Dict]):
"""
Concurrently obtain the specified post list and save the data
"""
semaphore = asyncio.Semaphore(config.MAX_CONCURRENCY_NUM)
task_list = [
self.get_note_detail(post_item.get("note_id"), semaphore) for post_item in note_list
]
note_details = await asyncio.gather(*task_list)
for note_detail in note_details:
if note_detail is not None:
await xhs_store.update_xhs_note(note_detail)
async def get_specified_notes(self):
"""Get the information and comments of the specified post"""
semaphore = asyncio.Semaphore(config.MAX_CONCURRENCY_NUM)
task_list = [
self.get_note_detail(note_id=note_id, semaphore=semaphore) for note_id in config.XHS_SPECIFIED_ID_LIST
]
note_details = await asyncio.gather(*task_list)
for note_detail in note_details:
if note_detail is not None:
await xhs_store.update_xhs_note(note_detail)
await self.batch_get_note_comments(config.XHS_SPECIFIED_ID_LIST)
async def get_note_detail(self, note_id: str, semaphore: asyncio.Semaphore) -> Optional[Dict]:
"""Get note detail"""
async with semaphore:
try:
return await self.xhs_client.get_note_by_id(note_id)
except DataFetchError as ex:
utils.logger.error(f"[XiaoHongShuCrawler.get_note_detail] Get note detail error: {ex}")
return None
except KeyError as ex:
utils.logger.error(
f"[XiaoHongShuCrawler.get_note_detail] have not fund note detail note_id:{note_id}, err: {ex}")
return None
async def batch_get_note_comments(self, note_list: List[str]):
"""Batch get note comments"""
if not config.ENABLE_GET_COMMENTS:
utils.logger.info(f"[XiaoHongShuCrawler.batch_get_note_comments] Crawling comment mode is not enabled")
return
utils.logger.info(
f"[XiaoHongShuCrawler.batch_get_note_comments] Begin batch get note comments, note list: {note_list}")
semaphore = asyncio.Semaphore(config.MAX_CONCURRENCY_NUM)
task_list: List[Task] = []
for note_id in note_list:
task = asyncio.create_task(self.get_comments(note_id, semaphore), name=note_id)
task_list.append(task)
await asyncio.gather(*task_list)
async def get_comments(self, note_id: str, semaphore: asyncio.Semaphore):
"""Get note comments with keyword filtering and quantity limitation"""
async with semaphore:
utils.logger.info(f"[XiaoHongShuCrawler.get_comments] Begin get note id comments {note_id}")
await self.xhs_client.get_note_all_comments(
note_id=note_id,
crawl_interval=random.random(),
callback=xhs_store.batch_update_xhs_note_comments
)
@staticmethod
def format_proxy_info(ip_proxy_info: IpInfoModel) -> Tuple[Optional[Dict], Optional[Dict]]:
"""format proxy info for playwright and httpx"""
playwright_proxy = {
"server": f"{ip_proxy_info.protocol}{ip_proxy_info.ip}:{ip_proxy_info.port}",
"username": ip_proxy_info.user,
"password": ip_proxy_info.password,
}
httpx_proxy = {
f"{ip_proxy_info.protocol}": f"http://{ip_proxy_info.user}:{ip_proxy_info.password}@{ip_proxy_info.ip}:{ip_proxy_info.port}"
}
return playwright_proxy, httpx_proxy
async def create_xhs_client(self, httpx_proxy: Optional[str]) -> XiaoHongShuClient:
"""Create xhs client"""
utils.logger.info("[XiaoHongShuCrawler.create_xhs_client] Begin create xiaohongshu API client ...")
cookie_str, cookie_dict = utils.convert_cookies(await self.browser_context.cookies())
xhs_client_obj = XiaoHongShuClient(
proxies=httpx_proxy,
headers={
"User-Agent": self.user_agent,
"Cookie": cookie_str,
"Origin": "https://www.xiaohongshu.com",
"Referer": "https://www.xiaohongshu.com",
"Content-Type": "application/json;charset=UTF-8"
},
playwright_page=self.context_page,
cookie_dict=cookie_dict,
)
return xhs_client_obj
async def launch_browser(
self,
chromium: BrowserType,
playwright_proxy: Optional[Dict],
user_agent: Optional[str],
headless: bool = True
) -> BrowserContext:
"""Launch browser and create browser context"""
utils.logger.info("[XiaoHongShuCrawler.launch_browser] Begin create browser context ...")
if config.SAVE_LOGIN_STATE:
# feat issue #14
# we will save login state to avoid login every time
user_data_dir = os.path.join(os.getcwd(), "browser_data",
config.USER_DATA_DIR % config.PLATFORM) # type: ignore
browser_context = await chromium.launch_persistent_context(
user_data_dir=user_data_dir,
accept_downloads=True,
headless=headless,
proxy=playwright_proxy, # type: ignore
viewport={"width": 1920, "height": 1080},
user_agent=user_agent
)
return browser_context
else:
browser = await chromium.launch(headless=headless, proxy=playwright_proxy) # type: ignore
browser_context = await browser.new_context(
viewport={"width": 1920, "height": 1080},
user_agent=user_agent
)
return browser_context
async def close(self):
"""Close browser context"""
await self.browser_context.close()
utils.logger.info("[XiaoHongShuCrawler.close] Browser context closed ...")()

View file

@ -0,0 +1,9 @@
from httpx import RequestError
class DataFetchError(RequestError):
"""something error when fetch"""
class IPBlockError(RequestError):
"""fetch so fast that the server block us ip"""

View file

@ -0,0 +1,72 @@
from enum import Enum
from typing import NamedTuple
class FeedType(Enum):
# 推荐
RECOMMEND = "homefeed_recommend"
# 穿搭
FASION = "homefeed.fashion_v3"
# 美食
FOOD = "homefeed.food_v3"
# 彩妆
COSMETICS = "homefeed.cosmetics_v3"
# 影视
MOVIE = "homefeed.movie_and_tv_v3"
# 职场
CAREER = "homefeed.career_v3"
# 情感
EMOTION = "homefeed.love_v3"
# 家居
HOURSE = "homefeed.household_product_v3"
# 游戏
GAME = "homefeed.gaming_v3"
# 旅行
TRAVEL = "homefeed.travel_v3"
# 健身
FITNESS = "homefeed.fitness_v3"
class NoteType(Enum):
NORMAL = "normal"
VIDEO = "video"
class SearchSortType(Enum):
"""search sort type"""
# default
GENERAL = "general"
# most popular
MOST_POPULAR = "popularity_descending"
# Latest
LATEST = "time_descending"
class SearchNoteType(Enum):
"""search note type
"""
# default
ALL = 0
# only video
VIDEO = 1
# only image
IMAGE = 2
class Note(NamedTuple):
"""note tuple"""
note_id: str
title: str
desc: str
type: str
user: dict
img_urls: list
video_url: str
tag_list: list
at_user_list: list
collected_count: str
comment_count: str
liked_count: str
share_count: str
time: int
last_update_time: int

287
media_platform/xhs/help.py Normal file
View file

@ -0,0 +1,287 @@
import ctypes
import json
import random
import time
import urllib.parse
def sign(a1="", b1="", x_s="", x_t=""):
"""
takes in a URI (uniform resource identifier), an optional data dictionary, and an optional ctime parameter. It returns a dictionary containing two keys: "x-s" and "x-t".
"""
common = {
"s0": 5, # getPlatformCode
"s1": "",
"x0": "1", # localStorage.getItem("b1b1")
"x1": "3.3.0", # version
"x2": "Windows",
"x3": "xhs-pc-web",
"x4": "1.4.4",
"x5": a1, # cookie of a1
"x6": x_t,
"x7": x_s,
"x8": b1, # localStorage.getItem("b1")
"x9": mrc(x_t + x_s + b1),
"x10": 1, # getSigCount
}
encode_str = encodeUtf8(json.dumps(common, separators=(',', ':')))
x_s_common = b64Encode(encode_str)
x_b3_traceid = get_b3_trace_id()
return {
"x-s": x_s,
"x-t": x_t,
"x-s-common": x_s_common,
"x-b3-traceid": x_b3_traceid
}
def get_b3_trace_id():
re = "abcdef0123456789"
je = 16
e = ""
for t in range(16):
e += re[random.randint(0, je - 1)]
return e
def mrc(e):
ie = [
0, 1996959894, 3993919788, 2567524794, 124634137, 1886057615, 3915621685,
2657392035, 249268274, 2044508324, 3772115230, 2547177864, 162941995,
2125561021, 3887607047, 2428444049, 498536548, 1789927666, 4089016648,
2227061214, 450548861, 1843258603, 4107580753, 2211677639, 325883990,
1684777152, 4251122042, 2321926636, 335633487, 1661365465, 4195302755,
2366115317, 997073096, 1281953886, 3579855332, 2724688242, 1006888145,
1258607687, 3524101629, 2768942443, 901097722, 1119000684, 3686517206,
2898065728, 853044451, 1172266101, 3705015759, 2882616665, 651767980,
1373503546, 3369554304, 3218104598, 565507253, 1454621731, 3485111705,
3099436303, 671266974, 1594198024, 3322730930, 2970347812, 795835527,
1483230225, 3244367275, 3060149565, 1994146192, 31158534, 2563907772,
4023717930, 1907459465, 112637215, 2680153253, 3904427059, 2013776290,
251722036, 2517215374, 3775830040, 2137656763, 141376813, 2439277719,
3865271297, 1802195444, 476864866, 2238001368, 4066508878, 1812370925,
453092731, 2181625025, 4111451223, 1706088902, 314042704, 2344532202,
4240017532, 1658658271, 366619977, 2362670323, 4224994405, 1303535960,
984961486, 2747007092, 3569037538, 1256170817, 1037604311, 2765210733,
3554079995, 1131014506, 879679996, 2909243462, 3663771856, 1141124467,
855842277, 2852801631, 3708648649, 1342533948, 654459306, 3188396048,
3373015174, 1466479909, 544179635, 3110523913, 3462522015, 1591671054,
702138776, 2966460450, 3352799412, 1504918807, 783551873, 3082640443,
3233442989, 3988292384, 2596254646, 62317068, 1957810842, 3939845945,
2647816111, 81470997, 1943803523, 3814918930, 2489596804, 225274430,
2053790376, 3826175755, 2466906013, 167816743, 2097651377, 4027552580,
2265490386, 503444072, 1762050814, 4150417245, 2154129355, 426522225,
1852507879, 4275313526, 2312317920, 282753626, 1742555852, 4189708143,
2394877945, 397917763, 1622183637, 3604390888, 2714866558, 953729732,
1340076626, 3518719985, 2797360999, 1068828381, 1219638859, 3624741850,
2936675148, 906185462, 1090812512, 3747672003, 2825379669, 829329135,
1181335161, 3412177804, 3160834842, 628085408, 1382605366, 3423369109,
3138078467, 570562233, 1426400815, 3317316542, 2998733608, 733239954,
1555261956, 3268935591, 3050360625, 752459403, 1541320221, 2607071920,
3965973030, 1969922972, 40735498, 2617837225, 3943577151, 1913087877,
83908371, 2512341634, 3803740692, 2075208622, 213261112, 2463272603,
3855990285, 2094854071, 198958881, 2262029012, 4057260610, 1759359992,
534414190, 2176718541, 4139329115, 1873836001, 414664567, 2282248934,
4279200368, 1711684554, 285281116, 2405801727, 4167216745, 1634467795,
376229701, 2685067896, 3608007406, 1308918612, 956543938, 2808555105,
3495958263, 1231636301, 1047427035, 2932959818, 3654703836, 1088359270,
936918000, 2847714899, 3736837829, 1202900863, 817233897, 3183342108,
3401237130, 1404277552, 615818150, 3134207493, 3453421203, 1423857449,
601450431, 3009837614, 3294710456, 1567103746, 711928724, 3020668471,
3272380065, 1510334235, 755167117,
]
o = -1
def right_without_sign(num: int, bit: int=0) -> int:
val = ctypes.c_uint32(num).value >> bit
MAX32INT = 4294967295
return (val + (MAX32INT + 1)) % (2 * (MAX32INT + 1)) - MAX32INT - 1
for n in range(57):
o = ie[(o & 255) ^ ord(e[n])] ^ right_without_sign(o, 8)
return o ^ -1 ^ 3988292384
lookup = [
"Z",
"m",
"s",
"e",
"r",
"b",
"B",
"o",
"H",
"Q",
"t",
"N",
"P",
"+",
"w",
"O",
"c",
"z",
"a",
"/",
"L",
"p",
"n",
"g",
"G",
"8",
"y",
"J",
"q",
"4",
"2",
"K",
"W",
"Y",
"j",
"0",
"D",
"S",
"f",
"d",
"i",
"k",
"x",
"3",
"V",
"T",
"1",
"6",
"I",
"l",
"U",
"A",
"F",
"M",
"9",
"7",
"h",
"E",
"C",
"v",
"u",
"R",
"X",
"5",
]
def tripletToBase64(e):
return (
lookup[63 & (e >> 18)] +
lookup[63 & (e >> 12)] +
lookup[(e >> 6) & 63] +
lookup[e & 63]
)
def encodeChunk(e, t, r):
m = []
for b in range(t, r, 3):
n = (16711680 & (e[b] << 16)) + \
((e[b + 1] << 8) & 65280) + (e[b + 2] & 255)
m.append(tripletToBase64(n))
return ''.join(m)
def b64Encode(e):
P = len(e)
W = P % 3
U = []
z = 16383
H = 0
Z = P - W
while H < Z:
U.append(encodeChunk(e, H, Z if H + z > Z else H + z))
H += z
if 1 == W:
F = e[P - 1]
U.append(lookup[F >> 2] + lookup[(F << 4) & 63] + "==")
elif 2 == W:
F = (e[P - 2] << 8) + e[P - 1]
U.append(lookup[F >> 10] + lookup[63 & (F >> 4)] +
lookup[(F << 2) & 63] + "=")
return "".join(U)
def encodeUtf8(e):
b = []
m = urllib.parse.quote(e, safe='~()*!.\'')
w = 0
while w < len(m):
T = m[w]
if T == "%":
E = m[w + 1] + m[w + 2]
S = int(E, 16)
b.append(S)
w += 2
else:
b.append(ord(T[0]))
w += 1
return b
def base36encode(number, alphabet='0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZ'):
"""Converts an integer to a base36 string."""
if not isinstance(number, int):
raise TypeError('number must be an integer')
base36 = ''
sign = ''
if number < 0:
sign = '-'
number = -number
if 0 <= number < len(alphabet):
return sign + alphabet[number]
while number != 0:
number, i = divmod(number, len(alphabet))
base36 = alphabet[i] + base36
return sign + base36
def base36decode(number):
return int(number, 36)
def get_search_id():
e = int(time.time() * 1000) << 64
t = int(random.uniform(0, 2147483646))
return base36encode((e + t))
img_cdns = [
"https://sns-img-qc.xhscdn.com",
"https://sns-img-hw.xhscdn.com",
"https://sns-img-bd.xhscdn.com",
"https://sns-img-qn.xhscdn.com",
]
def get_img_url_by_trace_id(trace_id: str, format_type: str = "png"):
return f"{random.choice(img_cdns)}/{trace_id}?imageView2/format/{format_type}"
def get_img_urls_by_trace_id(trace_id: str, format_type: str = "png"):
return [f"{cdn}/{trace_id}?imageView2/format/{format_type}" for cdn in img_cdns]
def get_trace_id(img_url: str):
# 浏览器端上传的图片多了 /spectrum/ 这个路径
return f"spectrum/{img_url.split('/')[-1]}" if img_url.find("spectrum") != -1 else img_url.split("/")[-1]
if __name__ == '__main__':
_img_url = "https://sns-img-bd.xhscdn.com/7a3abfaf-90c1-a828-5de7-022c80b92aa3"
# 获取一个图片地址在多个cdn下的url地址
# final_img_urls = get_img_urls_by_trace_id(get_trace_id(_img_url))
final_img_url = get_img_url_by_trace_id(get_trace_id(_img_url))
print(final_img_url)

186
media_platform/xhs/login.py Normal file
View file

@ -0,0 +1,186 @@
import asyncio
import functools
import sys
from typing import Optional
from playwright.async_api import BrowserContext, Page
from tenacity import (RetryError, retry, retry_if_result, stop_after_attempt,
wait_fixed)
import config
from base.base_crawler import AbstractLogin
from cache.cache_factory import CacheFactory
from tools import utils
class XiaoHongShuLogin(AbstractLogin):
def __init__(self,
login_type: str,
browser_context: BrowserContext,
context_page: Page,
login_phone: Optional[str] = "",
cookie_str: str = ""
):
config.LOGIN_TYPE = login_type
self.browser_context = browser_context
self.context_page = context_page
self.login_phone = login_phone
self.cookie_str = cookie_str
@retry(stop=stop_after_attempt(600), wait=wait_fixed(1), retry=retry_if_result(lambda value: value is False))
async def check_login_state(self, no_logged_in_session: str) -> bool:
"""
Check if the current login status is successful and return True otherwise return False
retry decorator will retry 20 times if the return value is False, and the retry interval is 1 second
if max retry times reached, raise RetryError
"""
if "请通过验证" in await self.context_page.content():
utils.logger.info("[XiaoHongShuLogin.check_login_state] 登录过程中出现验证码,请手动验证")
current_cookie = await self.browser_context.cookies()
_, cookie_dict = utils.convert_cookies(current_cookie)
current_web_session = cookie_dict.get("web_session")
if current_web_session != no_logged_in_session:
return True
return False
async def begin(self):
"""Start login xiaohongshu"""
utils.logger.info("[XiaoHongShuLogin.begin] Begin login xiaohongshu ...")
if config.LOGIN_TYPE == "qrcode":
await self.login_by_qrcode()
elif config.LOGIN_TYPE == "phone":
await self.login_by_mobile()
elif config.LOGIN_TYPE == "cookie":
await self.login_by_cookies()
else:
raise ValueError("[XiaoHongShuLogin.begin]I nvalid Login Type Currently only supported qrcode or phone or cookies ...")
async def login_by_mobile(self):
"""Login xiaohongshu by mobile"""
utils.logger.info("[XiaoHongShuLogin.login_by_mobile] Begin login xiaohongshu by mobile ...")
await asyncio.sleep(1)
try:
# 小红书进入首页后,有可能不会自动弹出登录框,需要手动点击登录按钮
login_button_ele = await self.context_page.wait_for_selector(
selector="xpath=//*[@id='app']/div[1]/div[2]/div[1]/ul/div[1]/button",
timeout=5000
)
await login_button_ele.click()
# 弹窗的登录对话框也有两种形态,一种是直接可以看到手机号和验证码的
# 另一种是需要点击切换到手机登录的
element = await self.context_page.wait_for_selector(
selector='xpath=//div[@class="login-container"]//div[@class="other-method"]/div[1]',
timeout=5000
)
await element.click()
except Exception as e:
utils.logger.info("[XiaoHongShuLogin.login_by_mobile] have not found mobile button icon and keep going ...")
await asyncio.sleep(1)
login_container_ele = await self.context_page.wait_for_selector("div.login-container")
input_ele = await login_container_ele.query_selector("label.phone > input")
await input_ele.fill(self.login_phone)
await asyncio.sleep(0.5)
send_btn_ele = await login_container_ele.query_selector("label.auth-code > span")
await send_btn_ele.click() # 点击发送验证码
sms_code_input_ele = await login_container_ele.query_selector("label.auth-code > input")
submit_btn_ele = await login_container_ele.query_selector("div.input-container > button")
cache_client = CacheFactory.create_cache(config.CACHE_TYPE_MEMORY)
max_get_sms_code_time = 60 * 2 # 最长获取验证码的时间为2分钟
no_logged_in_session = ""
while max_get_sms_code_time > 0:
utils.logger.info(f"[XiaoHongShuLogin.login_by_mobile] get sms code from redis remaining time {max_get_sms_code_time}s ...")
await asyncio.sleep(1)
sms_code_key = f"xhs_{self.login_phone}"
sms_code_value = cache_client.get(sms_code_key)
if not sms_code_value:
max_get_sms_code_time -= 1
continue
current_cookie = await self.browser_context.cookies()
_, cookie_dict = utils.convert_cookies(current_cookie)
no_logged_in_session = cookie_dict.get("web_session")
await sms_code_input_ele.fill(value=sms_code_value.decode()) # 输入短信验证码
await asyncio.sleep(0.5)
agree_privacy_ele = self.context_page.locator("xpath=//div[@class='agreements']//*[local-name()='svg']")
await agree_privacy_ele.click() # 点击同意隐私协议
await asyncio.sleep(0.5)
await submit_btn_ele.click() # 点击登录
# todo ... 应该还需要检查验证码的正确性有可能输入的验证码不正确
break
try:
await self.check_login_state(no_logged_in_session)
except RetryError:
utils.logger.info("[XiaoHongShuLogin.login_by_mobile] Login xiaohongshu failed by mobile login method ...")
sys.exit()
wait_redirect_seconds = 5
utils.logger.info(f"[XiaoHongShuLogin.login_by_mobile] Login successful then wait for {wait_redirect_seconds} seconds redirect ...")
await asyncio.sleep(wait_redirect_seconds)
async def login_by_qrcode(self):
"""login xiaohongshu website and keep webdriver login state"""
utils.logger.info("[XiaoHongShuLogin.login_by_qrcode] Begin login xiaohongshu by qrcode ...")
# login_selector = "div.login-container > div.left > div.qrcode > img"
qrcode_img_selector = "xpath=//img[@class='qrcode-img']"
# find login qrcode
base64_qrcode_img = await utils.find_login_qrcode(
self.context_page,
selector=qrcode_img_selector
)
if not base64_qrcode_img:
utils.logger.info("[XiaoHongShuLogin.login_by_qrcode] login failed , have not found qrcode please check ....")
# if this website does not automatically popup login dialog box, we will manual click login button
await asyncio.sleep(0.5)
login_button_ele = self.context_page.locator("xpath=//*[@id='app']/div[1]/div[2]/div[1]/ul/div[1]/button")
await login_button_ele.click()
base64_qrcode_img = await utils.find_login_qrcode(
self.context_page,
selector=qrcode_img_selector
)
if not base64_qrcode_img:
sys.exit()
# get not logged session
current_cookie = await self.browser_context.cookies()
_, cookie_dict = utils.convert_cookies(current_cookie)
no_logged_in_session = cookie_dict.get("web_session")
# show login qrcode
# fix issue #12
# we need to use partial function to call show_qrcode function and run in executor
# then current asyncio event loop will not be blocked
partial_show_qrcode = functools.partial(utils.show_qrcode, base64_qrcode_img)
asyncio.get_running_loop().run_in_executor(executor=None, func=partial_show_qrcode)
utils.logger.info(f"[XiaoHongShuLogin.login_by_qrcode] waiting for scan code login, remaining time is 120s")
try:
await self.check_login_state(no_logged_in_session)
except RetryError:
utils.logger.info("[XiaoHongShuLogin.login_by_qrcode] Login xiaohongshu failed by qrcode login method ...")
sys.exit()
wait_redirect_seconds = 5
utils.logger.info(f"[XiaoHongShuLogin.login_by_qrcode] Login successful then wait for {wait_redirect_seconds} seconds redirect ...")
await asyncio.sleep(wait_redirect_seconds)
async def login_by_cookies(self):
"""login xiaohongshu website by cookies"""
utils.logger.info("[XiaoHongShuLogin.login_by_cookies] Begin login xiaohongshu by cookie ...")
for key, value in utils.convert_str_cookie_to_dict(self.cookie_str).items():
if key != "web_session": # only set web_session cookie attr
continue
await self.browser_context.add_cookies([{
'name': key,
'value': value,
'domain': ".xiaohongshu.com",
'path': "/"
}])

9
mypy.ini Normal file
View file

@ -0,0 +1,9 @@
[mypy]
warn_return_any = True
warn_unused_configs = True
[mypy-cv2]
ignore_missing_imports = True
[mypy-execjs]
ignore_missing_imports = True

1
note_info.txt Normal file

File diff suppressed because one or more lines are too long

5
proxy/__init__.py Normal file
View file

@ -0,0 +1,5 @@
# -*- coding: utf-8 -*-
# @Author : relakkes@gmail.com
# @Time : 2023/12/2 14:37
# @Desc : IP代理池入口
from .base_proxy import *

63
proxy/base_proxy.py Normal file
View file

@ -0,0 +1,63 @@
# -*- coding: utf-8 -*-
# @Author : relakkes@gmail.com
# @Time : 2023/12/2 11:18
# @Desc : 爬虫 IP 获取实现
# @Url : 快代理HTTP实现官方文档https://www.kuaidaili.com/?ref=ldwkjqipvz6c
import json
from abc import ABC, abstractmethod
from typing import List
import config
from cache.abs_cache import AbstractCache
from cache.cache_factory import CacheFactory
from tools import utils
from .types import IpInfoModel
class IpGetError(Exception):
""" ip get error"""
class ProxyProvider(ABC):
@abstractmethod
async def get_proxies(self, num: int) -> List[IpInfoModel]:
"""
获取 IP 的抽象方法不同的 HTTP 代理商需要实现该方法
:param num: 提取的 IP 数量
:return:
"""
pass
class IpCache:
def __init__(self):
self.cache_client: AbstractCache = CacheFactory.create_cache(cache_type=config.CACHE_TYPE_MEMORY)
def set_ip(self, ip_key: str, ip_value_info: str, ex: int):
"""
设置IP并带有过期时间到期之后由 redis 负责删除
:param ip_key:
:param ip_value_info:
:param ex:
:return:
"""
self.cache_client.set(key=ip_key, value=ip_value_info, expire_time=ex)
def load_all_ip(self, proxy_brand_name: str) -> List[IpInfoModel]:
"""
redis 中加载所有还未过期的 IP 信息
:param proxy_brand_name: 代理商名称
:return:
"""
all_ip_list: List[IpInfoModel] = []
all_ip_keys: List[str] = self.cache_client.keys(pattern=f"{proxy_brand_name}_*")
try:
for ip_key in all_ip_keys:
ip_value = self.cache_client.get(ip_key)
if not ip_value:
continue
all_ip_list.append(IpInfoModel(**json.loads(ip_value)))
except Exception as e:
utils.logger.error("[IpCache.load_all_ip] get ip err from redis db", e)
return all_ip_list

View file

@ -0,0 +1,6 @@
# -*- coding: utf-8 -*-
# @Author : relakkes@gmail.com
# @Time : 2024/4/5 10:13
# @Desc :
from .jishu_http_proxy import new_jisu_http_proxy
from .kuaidl_proxy import new_kuai_daili_proxy

View file

@ -0,0 +1,87 @@
# -*- coding: utf-8 -*-
# @Author : relakkes@gmail.com
# @Time : 2024/4/5 09:32
# @Desc : 已废弃倒闭了极速HTTP 代理IP实现. 请使用快代理实现proxy/providers/kuaidl_proxy.py
import os
from typing import Dict, List
from urllib.parse import urlencode
import httpx
from proxy import IpCache, IpGetError, ProxyProvider
from proxy.types import IpInfoModel
from tools import utils
class JiSuHttpProxy(ProxyProvider):
def __init__(self, key: str, crypto: str, time_validity_period: int):
"""
极速HTTP 代理IP实现
:param key: 提取key值 (去官网注册后获取)
:param crypto: 加密签名 (去官网注册后获取)
"""
self.proxy_brand_name = "JISUHTTP"
self.api_path = "https://api.jisuhttp.com"
self.params = {
"key": key,
"crypto": crypto,
"time": time_validity_period, # IP使用时长支持3、5、10、15、30分钟时效
"type": "json", # 数据结果为json
"port": "2", # IP协议1:HTTP、2:HTTPS、3:SOCKS5
"pw": "1", # 是否使用账密验证, 10否表示白名单验证默认为0
"se": "1", # 返回JSON格式时是否显示IP过期时间 1显示0不显示默认为0
}
self.ip_cache = IpCache()
async def get_proxies(self, num: int) -> List[IpInfoModel]:
"""
:param num:
:return:
"""
# 优先从缓存中拿 IP
ip_cache_list = self.ip_cache.load_all_ip(proxy_brand_name=self.proxy_brand_name)
if len(ip_cache_list) >= num:
return ip_cache_list[:num]
# 如果缓存中的数量不够从IP代理商获取补上再存入缓存中
need_get_count = num - len(ip_cache_list)
self.params.update({"num": need_get_count})
ip_infos = []
async with httpx.AsyncClient() as client:
url = self.api_path + "/fetchips" + '?' + urlencode(self.params)
utils.logger.info(f"[JiSuHttpProxy.get_proxies] get ip proxy url:{url}")
response = await client.get(url, headers={
"User-Agent": "MediaCrawler https://github.com/NanmiCoder/MediaCrawler"})
res_dict: Dict = response.json()
if res_dict.get("code") == 0:
data: List[Dict] = res_dict.get("data")
current_ts = utils.get_unix_timestamp()
for ip_item in data:
ip_info_model = IpInfoModel(
ip=ip_item.get("ip"),
port=ip_item.get("port"),
user=ip_item.get("user"),
password=ip_item.get("pass"),
expired_time_ts=utils.get_unix_time_from_time_str(ip_item.get("expire"))
)
ip_key = f"JISUHTTP_{ip_info_model.ip}_{ip_info_model.port}_{ip_info_model.user}_{ip_info_model.password}"
ip_value = ip_info_model.json()
ip_infos.append(ip_info_model)
self.ip_cache.set_ip(ip_key, ip_value, ex=ip_info_model.expired_time_ts - current_ts)
else:
raise IpGetError(res_dict.get("msg", "unkown err"))
return ip_cache_list + ip_infos
def new_jisu_http_proxy() -> JiSuHttpProxy:
"""
构造极速HTTP实例
Returns:
"""
return JiSuHttpProxy(
key=os.getenv("jisu_key", ""), # 通过环境变量的方式获取极速HTTPIP提取key值
crypto=os.getenv("jisu_crypto", ""), # 通过环境变量的方式获取极速HTTPIP提取加密签名
time_validity_period=30 # 30分钟最长时效
)

View file

@ -0,0 +1,134 @@
# -*- coding: utf-8 -*-
# @Author : relakkes@gmail.com
# @Time : 2024/4/5 09:43
# @Desc : 快代理HTTP实现官方文档https://www.kuaidaili.com/?ref=ldwkjqipvz6c
import os
import re
from typing import Dict, List
import httpx
from pydantic import BaseModel, Field
from proxy import IpCache, IpInfoModel, ProxyProvider
from proxy.types import ProviderNameEnum
from tools import utils
class KuaidailiProxyModel(BaseModel):
ip: str = Field("ip")
port: int = Field("端口")
expire_ts: int = Field("过期时间")
def parse_kuaidaili_proxy(proxy_info: str) -> KuaidailiProxyModel:
"""
解析快代理的IP信息
Args:
proxy_info:
Returns:
"""
proxies: List[str] = proxy_info.split(":")
if len(proxies) != 2:
raise Exception("not invalid kuaidaili proxy info")
pattern = r'(\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}):(\d{1,5}),(\d+)'
match = re.search(pattern, proxy_info)
if not match.groups():
raise Exception("not match kuaidaili proxy info")
return KuaidailiProxyModel(
ip=match.groups()[0],
port=int(match.groups()[1]),
expire_ts=int(match.groups()[2])
)
class KuaiDaiLiProxy(ProxyProvider):
def __init__(self, kdl_user_name: str, kdl_user_pwd: str, kdl_secret_id: str, kdl_signature: str):
"""
Args:
kdl_user_name:
kdl_user_pwd:
"""
self.kdl_user_name = kdl_user_name
self.kdl_user_pwd = kdl_user_pwd
self.api_base = "https://dps.kdlapi.com/"
self.secret_id = kdl_secret_id
self.signature = kdl_signature
self.ip_cache = IpCache()
self.proxy_brand_name = ProviderNameEnum.KUAI_DAILI_PROVIDER.value
self.params = {
"secret_id": self.secret_id,
"signature": self.signature,
"pt": 1,
"format": "json",
"sep": 1,
"f_et": 1,
}
async def get_proxies(self, num: int) -> List[IpInfoModel]:
"""
快代理实现
Args:
num:
Returns:
"""
uri = "/api/getdps/"
# 优先从缓存中拿 IP
ip_cache_list = self.ip_cache.load_all_ip(proxy_brand_name=self.proxy_brand_name)
if len(ip_cache_list) >= num:
return ip_cache_list[:num]
# 如果缓存中的数量不够从IP代理商获取补上再存入缓存中
need_get_count = num - len(ip_cache_list)
self.params.update({"num": need_get_count})
ip_infos: List[IpInfoModel] = []
async with httpx.AsyncClient() as client:
response = await client.get(self.api_base + uri, params=self.params)
if response.status_code != 200:
utils.logger.error(f"[KuaiDaiLiProxy.get_proxies] statuc code not 200 and response.txt:{response.text}")
raise Exception("get ip error from proxy provider and status code not 200 ...")
ip_response: Dict = response.json()
if ip_response.get("code") != 0:
utils.logger.error(f"[KuaiDaiLiProxy.get_proxies] code not 0 and msg:{ip_response.get('msg')}")
raise Exception("get ip error from proxy provider and code not 0 ...")
proxy_list: List[str] = ip_response.get("data", {}).get("proxy_list")
for proxy in proxy_list:
proxy_model = parse_kuaidaili_proxy(proxy)
ip_info_model = IpInfoModel(
ip=proxy_model.ip,
port=proxy_model.port,
user=self.kdl_user_name,
password=self.kdl_user_pwd,
expired_time_ts=proxy_model.expire_ts,
)
ip_key = f"{self.proxy_brand_name}_{ip_info_model.ip}_{ip_info_model.port}"
self.ip_cache.set_ip(ip_key, ip_info_model.model_dump_json(), ex=ip_info_model.expired_time_ts)
ip_infos.append(ip_info_model)
return ip_cache_list + ip_infos
def new_kuai_daili_proxy() -> KuaiDaiLiProxy:
"""
构造快代理HTTP实例
Returns:
"""
return KuaiDaiLiProxy(
kdl_secret_id=os.getenv("kdl_secret_id", "你的快代理secert_id"),
kdl_signature=os.getenv("kdl_signature", "你的快代理签名"),
kdl_user_name=os.getenv("kdl_user_name", "你的快代理用户名"),
kdl_user_pwd=os.getenv("kdl_user_pwd", "你的快代理密码"),
)

110
proxy/proxy_ip_pool.py Normal file
View file

@ -0,0 +1,110 @@
# -*- coding: utf-8 -*-
# @Author : relakkes@gmail.com
# @Time : 2023/12/2 13:45
# @Desc : ip代理池实现
import random
from typing import Dict, List
import httpx
from tenacity import retry, stop_after_attempt, wait_fixed
import config
from proxy.providers import new_jisu_http_proxy, new_kuai_daili_proxy
from tools import utils
from .base_proxy import ProxyProvider
from .types import IpInfoModel, ProviderNameEnum
class ProxyIpPool:
def __init__(self, ip_pool_count: int, enable_validate_ip: bool, ip_provider: ProxyProvider) -> None:
"""
Args:
ip_pool_count:
enable_validate_ip:
ip_provider:
"""
self.valid_ip_url = "https://httpbin.org/ip" # 验证 IP 是否有效的地址
self.ip_pool_count = ip_pool_count
self.enable_validate_ip = enable_validate_ip
self.proxy_list: List[IpInfoModel] = []
self.ip_provider: ProxyProvider = ip_provider
async def load_proxies(self) -> None:
"""
加载IP代理
Returns:
"""
self.proxy_list = await self.ip_provider.get_proxies(self.ip_pool_count)
async def _is_valid_proxy(self, proxy: IpInfoModel) -> bool:
"""
验证代理IP是否有效
:param proxy:
:return:
"""
utils.logger.info(f"[ProxyIpPool._is_valid_proxy] testing {proxy.ip} is it valid ")
try:
httpx_proxy = {
f"{proxy.protocol}": f"http://{proxy.user}:{proxy.password}@{proxy.ip}:{proxy.port}"
}
async with httpx.AsyncClient(proxies=httpx_proxy) as client:
response = await client.get(self.valid_ip_url)
if response.status_code == 200:
return True
else:
return False
except Exception as e:
utils.logger.info(f"[ProxyIpPool._is_valid_proxy] testing {proxy.ip} err: {e}")
raise e
@retry(stop=stop_after_attempt(3), wait=wait_fixed(1))
async def get_proxy(self) -> IpInfoModel:
"""
从代理池中随机提取一个代理IP
:return:
"""
if len(self.proxy_list) == 0:
await self._reload_proxies()
proxy = random.choice(self.proxy_list)
self.proxy_list.remove(proxy) # 取出来一个IP就应该移出掉
if self.enable_validate_ip:
if not await self._is_valid_proxy(proxy):
raise Exception("[ProxyIpPool.get_proxy] current ip invalid and again get it")
return proxy
async def _reload_proxies(self):
"""
# 重新加载代理池
:return:
"""
self.proxy_list = []
await self.load_proxies()
IpProxyProvider: Dict[str, ProxyProvider] = {
ProviderNameEnum.JISHU_HTTP_PROVIDER.value: new_jisu_http_proxy(),
ProviderNameEnum.KUAI_DAILI_PROVIDER.value: new_kuai_daili_proxy()
}
async def create_ip_pool(ip_pool_count: int, enable_validate_ip: bool) -> ProxyIpPool:
"""
创建 IP 代理池
:param ip_pool_count: ip池子的数量
:param enable_validate_ip: 是否开启验证IP代理
:return:
"""
pool = ProxyIpPool(ip_pool_count=ip_pool_count,
enable_validate_ip=enable_validate_ip,
ip_provider=IpProxyProvider.get(config.IP_PROXY_PROVIDER_NAME)
)
await pool.load_proxies()
return pool
if __name__ == '__main__':
pass

23
proxy/types.py Normal file
View file

@ -0,0 +1,23 @@
# -*- coding: utf-8 -*-
# @Author : relakkes@gmail.com
# @Time : 2024/4/5 10:18
# @Desc : 基础类型
from enum import Enum
from typing import Optional
from pydantic import BaseModel, Field
class ProviderNameEnum(Enum):
JISHU_HTTP_PROVIDER: str = "jishuhttp"
KUAI_DAILI_PROVIDER: str = "kuaidaili"
class IpInfoModel(BaseModel):
"""Unified IP model"""
ip: str = Field(title="ip")
port: int = Field(title="端口")
user: str = Field(title="IP代理认证的用户名")
protocol: str = Field(default="https://", title="代理IP的协议")
password: str = Field(title="IP代理认证用户的密码")
expired_time_ts: Optional[int] = Field(title="IP 过期时间")

68
recv_sms.py Normal file
View file

@ -0,0 +1,68 @@
import re
from typing import List
import uvicorn
from fastapi import FastAPI, HTTPException, status
from pydantic import BaseModel
import config
from cache.abs_cache import AbstractCache
from cache.cache_factory import CacheFactory
from tools import utils
app = FastAPI()
cache_client : AbstractCache = CacheFactory.create_cache(cache_type=config.CACHE_TYPE_MEMORY)
class SmsNotification(BaseModel):
platform: str
current_number: str
from_number: str
sms_content: str
timestamp: str
def extract_verification_code(message: str) -> str:
"""
Extract verification code of 6 digits from the SMS.
"""
pattern = re.compile(r'\b[0-9]{6}\b')
codes: List[str] = pattern.findall(message)
return codes[0] if codes else ""
@app.post("/")
def receive_sms_notification(sms: SmsNotification):
"""
Receive SMS notification and send it to Redis.
Args:
sms:
{
"platform": "xhs",
"from_number": "1069421xxx134",
"sms_content": "【小红书】您的验证码是: 171959 3分钟内有效。请勿向他人泄漏。如非本人操作可忽略本消息。",
"timestamp": "1686720601614",
"current_number": "13152442222"
}
Returns:
"""
utils.logger.info(f"Received SMS notification: {sms.platform}, {sms.current_number}")
sms_code = extract_verification_code(sms.sms_content)
if sms_code:
# Save the verification code in Redis and set the expiration time to 3 minutes.
key = f"{sms.platform}_{sms.current_number}"
cache_client.set(key, sms_code, expire_time=60 * 3)
return {"status": "ok"}
@app.get("/", status_code=status.HTTP_404_NOT_FOUND)
async def not_found():
raise HTTPException(status_code=404, detail="Not Found")
if __name__ == '__main__':
uvicorn.run(app, port=8000, host='0.0.0.0')

17
requirements.txt Normal file
View file

@ -0,0 +1,17 @@
httpx==0.24.0
Pillow==9.5.0
playwright==1.42.0
tenacity==8.2.2
PyExecJS==1.5.1
opencv-python
aiomysql==0.2.0
redis~=4.6.0
pydantic==2.5.2
aiofiles~=23.2.1
fastapi==0.110.2
uvicorn==0.29.0
python-dotenv==1.0.1
jieba==0.42.1
wordcloud==1.9.3
matplotlib==3.9.0
beautifulsoup4==4.12.3

317
schema/tables.sql Normal file
View file

@ -0,0 +1,317 @@
-- ----------------------------
-- Table structure for bilibili_video
-- ----------------------------
DROP TABLE IF EXISTS `bilibili_video`;
CREATE TABLE `bilibili_video` (
`id` int NOT NULL AUTO_INCREMENT COMMENT '自增ID',
`user_id` varchar(64) DEFAULT NULL COMMENT '用户ID',
`nickname` varchar(64) DEFAULT NULL COMMENT '用户昵称',
`avatar` varchar(255) DEFAULT NULL COMMENT '用户头像地址',
`add_ts` bigint NOT NULL COMMENT '记录添加时间戳',
`last_modify_ts` bigint NOT NULL COMMENT '记录最后修改时间戳',
`video_id` varchar(64) NOT NULL COMMENT '视频ID',
`video_type` varchar(16) NOT NULL COMMENT '视频类型',
`title` varchar(500) DEFAULT NULL COMMENT '视频标题',
`desc` longtext COMMENT '视频描述',
`create_time` bigint NOT NULL COMMENT '视频发布时间戳',
`liked_count` varchar(16) DEFAULT NULL COMMENT '视频点赞数',
`video_play_count` varchar(16) DEFAULT NULL COMMENT '视频播放数量',
`video_danmaku` varchar(16) DEFAULT NULL COMMENT '视频弹幕数量',
`video_comment` varchar(16) DEFAULT NULL COMMENT '视频评论数量',
`video_url` varchar(512) DEFAULT NULL COMMENT '视频详情URL',
`video_cover_url` varchar(512) DEFAULT NULL COMMENT '视频封面图 URL',
PRIMARY KEY (`id`),
KEY `idx_bilibili_vi_video_i_31c36e` (`video_id`),
KEY `idx_bilibili_vi_create__73e0ec` (`create_time`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_0900_ai_ci COMMENT='B站视频';
-- ----------------------------
-- Table structure for bilibili_video_comment
-- ----------------------------
DROP TABLE IF EXISTS `bilibili_video_comment`;
CREATE TABLE `bilibili_video_comment` (
`id` int NOT NULL AUTO_INCREMENT COMMENT '自增ID',
`user_id` varchar(64) DEFAULT NULL COMMENT '用户ID',
`nickname` varchar(64) DEFAULT NULL COMMENT '用户昵称',
`avatar` varchar(255) DEFAULT NULL COMMENT '用户头像地址',
`add_ts` bigint NOT NULL COMMENT '记录添加时间戳',
`last_modify_ts` bigint NOT NULL COMMENT '记录最后修改时间戳',
`comment_id` varchar(64) NOT NULL COMMENT '评论ID',
`video_id` varchar(64) NOT NULL COMMENT '视频ID',
`content` longtext COMMENT '评论内容',
`create_time` bigint NOT NULL COMMENT '评论时间戳',
`sub_comment_count` varchar(16) NOT NULL COMMENT '评论回复数',
PRIMARY KEY (`id`),
KEY `idx_bilibili_vi_comment_41c34e` (`comment_id`),
KEY `idx_bilibili_vi_video_i_f22873` (`video_id`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_0900_ai_ci COMMENT='B 站视频评论';
-- ----------------------------
-- Table structure for douyin_aweme
-- ----------------------------
DROP TABLE IF EXISTS `douyin_aweme`;
CREATE TABLE `douyin_aweme` (
`id` int NOT NULL AUTO_INCREMENT COMMENT '自增ID',
`user_id` varchar(64) DEFAULT NULL COMMENT '用户ID',
`sec_uid` varchar(128) DEFAULT NULL COMMENT '用户sec_uid',
`short_user_id` varchar(64) DEFAULT NULL COMMENT '用户短ID',
`user_unique_id` varchar(64) DEFAULT NULL COMMENT '用户唯一ID',
`nickname` varchar(64) DEFAULT NULL COMMENT '用户昵称',
`avatar` varchar(255) DEFAULT NULL COMMENT '用户头像地址',
`user_signature` varchar(500) DEFAULT NULL COMMENT '用户签名',
`ip_location` varchar(255) DEFAULT NULL COMMENT '评论时的IP地址',
`add_ts` bigint NOT NULL COMMENT '记录添加时间戳',
`last_modify_ts` bigint NOT NULL COMMENT '记录最后修改时间戳',
`aweme_id` varchar(64) NOT NULL COMMENT '视频ID',
`aweme_type` varchar(16) NOT NULL COMMENT '视频类型',
`title` varchar(500) DEFAULT NULL COMMENT '视频标题',
`desc` longtext COMMENT '视频描述',
`create_time` bigint NOT NULL COMMENT '视频发布时间戳',
`liked_count` varchar(16) DEFAULT NULL COMMENT '视频点赞数',
`comment_count` varchar(16) DEFAULT NULL COMMENT '视频评论数',
`share_count` varchar(16) DEFAULT NULL COMMENT '视频分享数',
`collected_count` varchar(16) DEFAULT NULL COMMENT '视频收藏数',
`aweme_url` varchar(255) DEFAULT NULL COMMENT '视频详情页URL',
PRIMARY KEY (`id`),
KEY `idx_douyin_awem_aweme_i_6f7bc6` (`aweme_id`),
KEY `idx_douyin_awem_create__299dfe` (`create_time`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_0900_ai_ci COMMENT='抖音视频';
-- ----------------------------
-- Table structure for douyin_aweme_comment
-- ----------------------------
DROP TABLE IF EXISTS `douyin_aweme_comment`;
CREATE TABLE `douyin_aweme_comment` (
`id` int NOT NULL AUTO_INCREMENT COMMENT '自增ID',
`user_id` varchar(64) DEFAULT NULL COMMENT '用户ID',
`sec_uid` varchar(128) DEFAULT NULL COMMENT '用户sec_uid',
`short_user_id` varchar(64) DEFAULT NULL COMMENT '用户短ID',
`user_unique_id` varchar(64) DEFAULT NULL COMMENT '用户唯一ID',
`nickname` varchar(64) DEFAULT NULL COMMENT '用户昵称',
`avatar` varchar(255) DEFAULT NULL COMMENT '用户头像地址',
`user_signature` varchar(500) DEFAULT NULL COMMENT '用户签名',
`ip_location` varchar(255) DEFAULT NULL COMMENT '评论时的IP地址',
`add_ts` bigint NOT NULL COMMENT '记录添加时间戳',
`last_modify_ts` bigint NOT NULL COMMENT '记录最后修改时间戳',
`comment_id` varchar(64) NOT NULL COMMENT '评论ID',
`aweme_id` varchar(64) NOT NULL COMMENT '视频ID',
`content` longtext COMMENT '评论内容',
`create_time` bigint NOT NULL COMMENT '评论时间戳',
`sub_comment_count` varchar(16) NOT NULL COMMENT '评论回复数',
PRIMARY KEY (`id`),
KEY `idx_douyin_awem_comment_fcd7e4` (`comment_id`),
KEY `idx_douyin_awem_aweme_i_c50049` (`aweme_id`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_0900_ai_ci COMMENT='抖音视频评论';
-- ----------------------------
-- Table structure for dy_creator
-- ----------------------------
DROP TABLE IF EXISTS `dy_creator`;
CREATE TABLE `dy_creator` (
`id` int NOT NULL AUTO_INCREMENT COMMENT '自增ID',
`user_id` varchar(128) NOT NULL COMMENT '用户ID',
`nickname` varchar(64) DEFAULT NULL COMMENT '用户昵称',
`avatar` varchar(255) DEFAULT NULL COMMENT '用户头像地址',
`ip_location` varchar(255) DEFAULT NULL COMMENT '评论时的IP地址',
`add_ts` bigint NOT NULL COMMENT '记录添加时间戳',
`last_modify_ts` bigint NOT NULL COMMENT '记录最后修改时间戳',
`desc` longtext COMMENT '用户描述',
`gender` varchar(1) DEFAULT NULL COMMENT '性别',
`follows` varchar(16) DEFAULT NULL COMMENT '关注数',
`fans` varchar(16) DEFAULT NULL COMMENT '粉丝数',
`interaction` varchar(16) DEFAULT NULL COMMENT '获赞数',
`videos_count` varchar(16) DEFAULT NULL COMMENT '作品数',
PRIMARY KEY (`id`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_0900_ai_ci COMMENT='抖音博主信息';
-- ----------------------------
-- Table structure for kuaishou_video
-- ----------------------------
DROP TABLE IF EXISTS `kuaishou_video`;
CREATE TABLE `kuaishou_video` (
`id` int NOT NULL AUTO_INCREMENT COMMENT '自增ID',
`user_id` varchar(64) DEFAULT NULL COMMENT '用户ID',
`nickname` varchar(64) DEFAULT NULL COMMENT '用户昵称',
`avatar` varchar(255) DEFAULT NULL COMMENT '用户头像地址',
`add_ts` bigint NOT NULL COMMENT '记录添加时间戳',
`last_modify_ts` bigint NOT NULL COMMENT '记录最后修改时间戳',
`video_id` varchar(64) NOT NULL COMMENT '视频ID',
`video_type` varchar(16) NOT NULL COMMENT '视频类型',
`title` varchar(500) DEFAULT NULL COMMENT '视频标题',
`desc` longtext COMMENT '视频描述',
`create_time` bigint NOT NULL COMMENT '视频发布时间戳',
`liked_count` varchar(16) DEFAULT NULL COMMENT '视频点赞数',
`viewd_count` varchar(16) DEFAULT NULL COMMENT '视频浏览数量',
`video_url` varchar(512) DEFAULT NULL COMMENT '视频详情URL',
`video_cover_url` varchar(512) DEFAULT NULL COMMENT '视频封面图 URL',
`video_play_url` varchar(512) DEFAULT NULL COMMENT '视频播放 URL',
PRIMARY KEY (`id`),
KEY `idx_kuaishou_vi_video_i_c5c6a6` (`video_id`),
KEY `idx_kuaishou_vi_create__a10dee` (`create_time`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_0900_ai_ci COMMENT='快手视频';
-- ----------------------------
-- Table structure for kuaishou_video_comment
-- ----------------------------
DROP TABLE IF EXISTS `kuaishou_video_comment`;
CREATE TABLE `kuaishou_video_comment` (
`id` int NOT NULL AUTO_INCREMENT COMMENT '自增ID',
`user_id` varchar(64) DEFAULT NULL COMMENT '用户ID',
`nickname` varchar(64) DEFAULT NULL COMMENT '用户昵称',
`avatar` varchar(255) DEFAULT NULL COMMENT '用户头像地址',
`add_ts` bigint NOT NULL COMMENT '记录添加时间戳',
`last_modify_ts` bigint NOT NULL COMMENT '记录最后修改时间戳',
`comment_id` varchar(64) NOT NULL COMMENT '评论ID',
`video_id` varchar(64) NOT NULL COMMENT '视频ID',
`content` longtext COMMENT '评论内容',
`create_time` bigint NOT NULL COMMENT '评论时间戳',
`sub_comment_count` varchar(16) NOT NULL COMMENT '评论回复数',
PRIMARY KEY (`id`),
KEY `idx_kuaishou_vi_comment_ed48fa` (`comment_id`),
KEY `idx_kuaishou_vi_video_i_e50914` (`video_id`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_0900_ai_ci COMMENT='快手视频评论';
-- ----------------------------
-- Table structure for weibo_note
-- ----------------------------
DROP TABLE IF EXISTS `weibo_note`;
CREATE TABLE `weibo_note` (
`id` int NOT NULL AUTO_INCREMENT COMMENT '自增ID',
`user_id` varchar(64) DEFAULT NULL COMMENT '用户ID',
`nickname` varchar(64) DEFAULT NULL COMMENT '用户昵称',
`avatar` varchar(255) DEFAULT NULL COMMENT '用户头像地址',
`gender` varchar(12) DEFAULT NULL COMMENT '用户性别',
`profile_url` varchar(255) DEFAULT NULL COMMENT '用户主页地址',
`ip_location` varchar(32) DEFAULT '发布微博的地理信息',
`add_ts` bigint NOT NULL COMMENT '记录添加时间戳',
`last_modify_ts` bigint NOT NULL COMMENT '记录最后修改时间戳',
`note_id` varchar(64) NOT NULL COMMENT '帖子ID',
`content` longtext COMMENT '帖子正文内容',
`create_time` bigint NOT NULL COMMENT '帖子发布时间戳',
`create_date_time` varchar(32) NOT NULL COMMENT '帖子发布日期时间',
`liked_count` varchar(16) DEFAULT NULL COMMENT '帖子点赞数',
`comments_count` varchar(16) DEFAULT NULL COMMENT '帖子评论数量',
`shared_count` varchar(16) DEFAULT NULL COMMENT '帖子转发数量',
`note_url` varchar(512) DEFAULT NULL COMMENT '帖子详情URL',
PRIMARY KEY (`id`),
KEY `idx_weibo_note_note_id_f95b1a` (`note_id`),
KEY `idx_weibo_note_create__692709` (`create_time`),
KEY `idx_weibo_note_create__d05ed2` (`create_date_time`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_0900_ai_ci COMMENT='微博帖子';
-- ----------------------------
-- Table structure for weibo_note_comment
-- ----------------------------
DROP TABLE IF EXISTS `weibo_note_comment`;
CREATE TABLE `weibo_note_comment` (
`id` int NOT NULL AUTO_INCREMENT COMMENT '自增ID',
`user_id` varchar(64) DEFAULT NULL COMMENT '用户ID',
`nickname` varchar(64) DEFAULT NULL COMMENT '用户昵称',
`avatar` varchar(255) DEFAULT NULL COMMENT '用户头像地址',
`gender` varchar(12) DEFAULT NULL COMMENT '用户性别',
`profile_url` varchar(255) DEFAULT NULL COMMENT '用户主页地址',
`ip_location` varchar(32) DEFAULT '发布微博的地理信息',
`add_ts` bigint NOT NULL COMMENT '记录添加时间戳',
`last_modify_ts` bigint NOT NULL COMMENT '记录最后修改时间戳',
`comment_id` varchar(64) NOT NULL COMMENT '评论ID',
`note_id` varchar(64) NOT NULL COMMENT '帖子ID',
`content` longtext COMMENT '评论内容',
`create_time` bigint NOT NULL COMMENT '评论时间戳',
`create_date_time` varchar(32) NOT NULL COMMENT '评论日期时间',
`comment_like_count` varchar(16) NOT NULL COMMENT '评论点赞数量',
`sub_comment_count` varchar(16) NOT NULL COMMENT '评论回复数',
PRIMARY KEY (`id`),
KEY `idx_weibo_note__comment_c7611c` (`comment_id`),
KEY `idx_weibo_note__note_id_24f108` (`note_id`),
KEY `idx_weibo_note__create__667fe3` (`create_date_time`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_0900_ai_ci COMMENT='微博帖子评论';
-- ----------------------------
-- Table structure for xhs_creator
-- ----------------------------
DROP TABLE IF EXISTS `xhs_creator`;
CREATE TABLE `xhs_creator` (
`id` int NOT NULL AUTO_INCREMENT COMMENT '自增ID',
`user_id` varchar(64) NOT NULL COMMENT '用户ID',
`nickname` varchar(64) DEFAULT NULL COMMENT '用户昵称',
`avatar` varchar(255) DEFAULT NULL COMMENT '用户头像地址',
`ip_location` varchar(255) DEFAULT NULL COMMENT '评论时的IP地址',
`add_ts` bigint NOT NULL COMMENT '记录添加时间戳',
`last_modify_ts` bigint NOT NULL COMMENT '记录最后修改时间戳',
`desc` longtext COMMENT '用户描述',
`gender` varchar(1) DEFAULT NULL COMMENT '性别',
`follows` varchar(16) DEFAULT NULL COMMENT '关注数',
`fans` varchar(16) DEFAULT NULL COMMENT '粉丝数',
`interaction` varchar(16) DEFAULT NULL COMMENT '获赞和收藏数',
`tag_list` longtext COMMENT '标签列表',
PRIMARY KEY (`id`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_0900_ai_ci COMMENT='小红书博主';
-- ----------------------------
-- Table structure for xhs_note
-- ----------------------------
DROP TABLE IF EXISTS `xhs_note`;
CREATE TABLE `xhs_note` (
`id` int NOT NULL AUTO_INCREMENT COMMENT '自增ID',
`user_id` varchar(64) NOT NULL COMMENT '用户ID',
`nickname` varchar(64) DEFAULT NULL COMMENT '用户昵称',
`avatar` varchar(255) DEFAULT NULL COMMENT '用户头像地址',
`ip_location` varchar(255) DEFAULT NULL COMMENT '评论时的IP地址',
`add_ts` bigint NOT NULL COMMENT '记录添加时间戳',
`last_modify_ts` bigint NOT NULL COMMENT '记录最后修改时间戳',
`note_id` varchar(64) NOT NULL COMMENT '笔记ID',
`type` varchar(16) DEFAULT NULL COMMENT '笔记类型(normal | video)',
`title` varchar(255) DEFAULT NULL COMMENT '笔记标题',
`desc` longtext COMMENT '笔记描述',
`video_url` longtext COMMENT '视频地址',
`time` bigint NOT NULL COMMENT '笔记发布时间戳',
`last_update_time` bigint NOT NULL COMMENT '笔记最后更新时间戳',
`liked_count` varchar(16) DEFAULT NULL COMMENT '笔记点赞数',
`collected_count` varchar(16) DEFAULT NULL COMMENT '笔记收藏数',
`comment_count` varchar(16) DEFAULT NULL COMMENT '笔记评论数',
`share_count` varchar(16) DEFAULT NULL COMMENT '笔记分享数',
`image_list` longtext COMMENT '笔记封面图片列表',
`tag_list` longtext COMMENT '标签列表',
`note_url` varchar(255) DEFAULT NULL COMMENT '笔记详情页的URL',
PRIMARY KEY (`id`),
KEY `idx_xhs_note_note_id_209457` (`note_id`),
KEY `idx_xhs_note_time_eaa910` (`time`)
) ENGINE=InnoDB AUTO_INCREMENT=1 DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_0900_ai_ci COMMENT='小红书笔记';
-- ----------------------------
-- Table structure for xhs_note_comment
-- ----------------------------
DROP TABLE IF EXISTS `xhs_note_comment`;
CREATE TABLE `xhs_note_comment` (
`id` int NOT NULL AUTO_INCREMENT COMMENT '自增ID',
`user_id` varchar(64) NOT NULL COMMENT '用户ID',
`nickname` varchar(64) DEFAULT NULL COMMENT '用户昵称',
`avatar` varchar(255) DEFAULT NULL COMMENT '用户头像地址',
`ip_location` varchar(255) DEFAULT NULL COMMENT '评论时的IP地址',
`add_ts` bigint NOT NULL COMMENT '记录添加时间戳',
`last_modify_ts` bigint NOT NULL COMMENT '记录最后修改时间戳',
`comment_id` varchar(64) NOT NULL COMMENT '评论ID',
`create_time` bigint NOT NULL COMMENT '评论时间戳',
`note_id` varchar(64) NOT NULL COMMENT '笔记ID',
`content` longtext NOT NULL COMMENT '评论内容',
`sub_comment_count` int NOT NULL COMMENT '子评论数量',
`pictures` varchar(512) DEFAULT NULL,
PRIMARY KEY (`id`),
KEY `idx_xhs_note_co_comment_8e8349` (`comment_id`),
KEY `idx_xhs_note_co_create__204f8d` (`create_time`)
) ENGINE=InnoDB AUTO_INCREMENT=1 DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_0900_ai_ci COMMENT='小红书笔记评论';
-- ----------------------------
-- alter table xhs_note_comment to support parent_comment_id
-- ----------------------------
ALTER TABLE `xhs_note_comment`
ADD COLUMN `parent_comment_id` VARCHAR(64) DEFAULT NULL COMMENT '父评论ID';
ALTER TABLE `douyin_aweme_comment`
ADD COLUMN `parent_comment_id` VARCHAR(64) DEFAULT NULL COMMENT '父评论ID';
ALTER TABLE `bilibili_video_comment`
ADD COLUMN `parent_comment_id` VARCHAR(64) DEFAULT NULL COMMENT '父评论ID';
SET FOREIGN_KEY_CHECKS = 1;

Binary file not shown.

After

Width:  |  Height:  |  Size: 171 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 189 KiB

BIN
static/images/img.png Normal file

Binary file not shown.

After

Width:  |  Height:  |  Size: 254 KiB

BIN
static/images/img_1.png Normal file

Binary file not shown.

After

Width:  |  Height:  |  Size: 200 KiB

BIN
static/images/img_2.png Normal file

Binary file not shown.

After

Width:  |  Height:  |  Size: 72 KiB

BIN
static/images/img_3.png Normal file

Binary file not shown.

After

Width:  |  Height:  |  Size: 140 KiB

BIN
static/images/img_4.png Normal file

Binary file not shown.

After

Width:  |  Height:  |  Size: 137 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 223 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 118 KiB

BIN
static/images/xingqiu.jpg Normal file

Binary file not shown.

After

Width:  |  Height:  |  Size: 241 KiB

BIN
static/images/zfb_pay.png Normal file

Binary file not shown.

After

Width:  |  Height:  |  Size: 484 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 96 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 261 KiB

4
store/__init__.py Normal file
View file

@ -0,0 +1,4 @@
# -*- coding: utf-8 -*-
# @Author : relakkes@gmail.com
# @Time : 2024/1/14 17:29
# @Desc :

View file

@ -0,0 +1,82 @@
# -*- coding: utf-8 -*-
# @Author : relakkes@gmail.com
# @Time : 2024/1/14 19:34
# @Desc :
from typing import List
import config
from .bilibili_store_impl import *
class BiliStoreFactory:
STORES = {
"csv": BiliCsvStoreImplement,
"db": BiliDbStoreImplement,
"json": BiliJsonStoreImplement
}
@staticmethod
def create_store() -> AbstractStore:
store_class = BiliStoreFactory.STORES.get(config.SAVE_DATA_OPTION)
if not store_class:
raise ValueError(
"[BiliStoreFactory.create_store] Invalid save option only supported csv or db or json ...")
return store_class()
async def update_bilibili_video(video_item: Dict):
video_item_view: Dict = video_item.get("View")
video_user_info: Dict = video_item_view.get("owner")
video_item_stat: Dict = video_item_view.get("stat")
video_id = str(video_item_view.get("aid"))
save_content_item = {
"video_id": video_id,
"video_type": "video",
"title": video_item_view.get("title", "")[:500],
"desc": video_item_view.get("desc", "")[:500],
"create_time": video_item_view.get("pubdate"),
"user_id": str(video_user_info.get("mid")),
"nickname": video_user_info.get("name"),
"avatar": video_user_info.get("face", ""),
"liked_count": str(video_item_stat.get("like", "")),
"video_play_count": str(video_item_stat.get("view", "")),
"video_danmaku": str(video_item_stat.get("danmaku", "")),
"video_comment": str(video_item_stat.get("reply", "")),
"last_modify_ts": utils.get_current_timestamp(),
"video_url": f"https://www.bilibili.com/video/av{video_id}",
"video_cover_url": video_item_view.get("pic", ""),
}
utils.logger.info(
f"[store.bilibili.update_bilibili_video] bilibili video id:{video_id}, title:{save_content_item.get('title')}")
await BiliStoreFactory.create_store().store_content(content_item=save_content_item)
async def batch_update_bilibili_video_comments(video_id: str, comments: List[Dict]):
if not comments:
return
for comment_item in comments:
await update_bilibili_video_comment(video_id, comment_item)
async def update_bilibili_video_comment(video_id: str, comment_item: Dict):
comment_id = str(comment_item.get("rpid"))
parent_comment_id = str(comment_item.get("parent", 0))
content: Dict = comment_item.get("content")
user_info: Dict = comment_item.get("member")
save_comment_item = {
"comment_id": comment_id,
"parent_comment_id": parent_comment_id,
"create_time": comment_item.get("ctime"),
"video_id": str(video_id),
"content": content.get("message"),
"user_id": user_info.get("mid"),
"nickname": user_info.get("uname"),
"avatar": user_info.get("avatar"),
"sub_comment_count": str(comment_item.get("rcount", 0)),
"last_modify_ts": utils.get_current_timestamp(),
}
utils.logger.info(
f"[store.bilibili.update_bilibili_video_comment] Bilibili video comment: {comment_id}, content: {save_comment_item.get('content')}")
await BiliStoreFactory.create_store().store_comment(comment_item=save_comment_item)

View file

@ -0,0 +1,206 @@
# -*- coding: utf-8 -*-
# @Author : relakkes@gmail.com
# @Time : 2024/1/14 19:34
# @Desc : B站存储实现类
import asyncio
import csv
import json
import os
import pathlib
from typing import Dict
import aiofiles
import config
from base.base_crawler import AbstractStore
from tools import utils, words
from var import crawler_type_var
def calculate_number_of_files(file_store_path: str) -> int:
"""计算数据保存文件的前部分排序数字,支持每次运行代码不写到同一个文件中
Args:
file_store_path;
Returns:
file nums
"""
if not os.path.exists(file_store_path):
return 1
try:
return max([int(file_name.split("_")[0])for file_name in os.listdir(file_store_path)])+1
except ValueError:
return 1
class BiliCsvStoreImplement(AbstractStore):
csv_store_path: str = "data/bilibili"
file_count:int=calculate_number_of_files(csv_store_path)
def make_save_file_name(self, store_type: str) -> str:
"""
make save file name by store type
Args:
store_type: contents or comments
Returns: eg: data/bilibili/search_comments_20240114.csv ...
"""
return f"{self.csv_store_path}/{self.file_count}_{crawler_type_var.get()}_{store_type}_{utils.get_current_date()}.csv"
async def save_data_to_csv(self, save_item: Dict, store_type: str):
"""
Below is a simple way to save it in CSV format.
Args:
save_item: save content dict info
store_type: Save type contains content and commentscontents | comments
Returns: no returns
"""
pathlib.Path(self.csv_store_path).mkdir(parents=True, exist_ok=True)
save_file_name = self.make_save_file_name(store_type=store_type)
async with aiofiles.open(save_file_name, mode='a+', encoding="utf-8-sig", newline="") as f:
writer = csv.writer(f)
if await f.tell() == 0:
await writer.writerow(save_item.keys())
await writer.writerow(save_item.values())
async def store_content(self, content_item: Dict):
"""
Bilibili content CSV storage implementation
Args:
content_item: note item dict
Returns:
"""
await self.save_data_to_csv(save_item=content_item, store_type="contents")
async def store_comment(self, comment_item: Dict):
"""
Bilibili comment CSV storage implementation
Args:
comment_item: comment item dict
Returns:
"""
await self.save_data_to_csv(save_item=comment_item, store_type="comments")
class BiliDbStoreImplement(AbstractStore):
async def store_content(self, content_item: Dict):
"""
Bilibili content DB storage implementation
Args:
content_item: content item dict
Returns:
"""
from .bilibili_store_sql import (add_new_content,
query_content_by_content_id,
update_content_by_content_id)
video_id = content_item.get("video_id")
video_detail: Dict = await query_content_by_content_id(content_id=video_id)
if not video_detail:
content_item["add_ts"] = utils.get_current_timestamp()
await add_new_content(content_item)
else:
await update_content_by_content_id(video_id, content_item=content_item)
async def store_comment(self, comment_item: Dict):
"""
Bilibili content DB storage implementation
Args:
comment_item: comment item dict
Returns:
"""
from .bilibili_store_sql import (add_new_comment,
query_comment_by_comment_id,
update_comment_by_comment_id)
comment_id = comment_item.get("comment_id")
comment_detail: Dict = await query_comment_by_comment_id(comment_id=comment_id)
if not comment_detail:
comment_item["add_ts"] = utils.get_current_timestamp()
await add_new_comment(comment_item)
else:
await update_comment_by_comment_id(comment_id, comment_item=comment_item)
class BiliJsonStoreImplement(AbstractStore):
json_store_path: str = "data/bilibili/json"
words_store_path: str = "data/bilibili/words"
lock = asyncio.Lock()
file_count:int=calculate_number_of_files(json_store_path)
WordCloud = words.AsyncWordCloudGenerator()
def make_save_file_name(self, store_type: str) -> (str,str):
"""
make save file name by store type
Args:
store_type: Save type contains content and commentscontents | comments
Returns:
"""
return (
f"{self.json_store_path}/{crawler_type_var.get()}_{store_type}_{utils.get_current_date()}.json",
f"{self.words_store_path}/{crawler_type_var.get()}_{store_type}_{utils.get_current_date()}"
)
async def save_data_to_json(self, save_item: Dict, store_type: str):
"""
Below is a simple way to save it in json format.
Args:
save_item: save content dict info
store_type: Save type contains content and commentscontents | comments
Returns:
"""
pathlib.Path(self.json_store_path).mkdir(parents=True, exist_ok=True)
pathlib.Path(self.words_store_path).mkdir(parents=True, exist_ok=True)
save_file_name,words_file_name_prefix = self.make_save_file_name(store_type=store_type)
save_data = []
async with self.lock:
if os.path.exists(save_file_name):
async with aiofiles.open(save_file_name, 'r', encoding='utf-8') as file:
save_data = json.loads(await file.read())
save_data.append(save_item)
async with aiofiles.open(save_file_name, 'w', encoding='utf-8') as file:
await file.write(json.dumps(save_data, ensure_ascii=False))
if config.ENABLE_GET_COMMENTS and config.ENABLE_GET_WORDCLOUD:
try:
await self.WordCloud.generate_word_frequency_and_cloud(save_data, words_file_name_prefix)
except:
pass
async def store_content(self, content_item: Dict):
"""
content JSON storage implementation
Args:
content_item:
Returns:
"""
await self.save_data_to_json(content_item, "contents")
async def store_comment(self, comment_item: Dict):
"""
comment JSON storage implementatio
Args:
comment_item:
Returns:
"""
await self.save_data_to_json(comment_item, "comments")

Some files were not shown because too many files have changed in this diff Show more