first commit

2024-07-15 16:33:05 +08:00 · 2024-07-15 16:33:05 +08:00 · 76bd37dd11
commit 76bd37dd11
128 changed files with 11672 additions and 0 deletions
--- a/.gitattributes
+++ b/.gitattributes
@ -0,0 +1,3 @@
 *.js linguist-language=python
 *.css linguist-language=python
 *.html linguist-language=python
--- a/.github/workflows/main.yaml
+++ b/.github/workflows/main.yaml
@ -0,0 +1,17 @@
 on:
    push:
        branches:
            - main
 jobs:
    contrib-readme-job:
        runs-on: ubuntu-latest
        name: A job to automate contrib in readme
        permissions:
          contents: write
          pull-requests: write
        steps:
            - name: Contribute List
              uses: akhilmhdh/contributors-readme-action@v2.3.10
              env:
                  GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
--- a/.gitignore
+++ b/.gitignore
@ -0,0 +1,175 @@
 # Byte-compiled / optimized / DLL files
 __pycache__/
 *.py[cod]
 *$py.class
 # C extensions
 *.so
 # Distribution / packaging
 .Python
 build/
 develop-eggs/
 dist/
 downloads/
 eggs/
 .eggs/
 lib/
 lib64/
 parts/
 sdist/
 var/
 wheels/
 share/python-wheels/
 *.egg-info/
 .installed.cfg
 *.egg
 MANIFEST
 # PyInstaller
 #  Usually these files are written by a python script from a template
 #  before PyInstaller builds the exe, so as to inject date/other infos into it.
 *.manifest
 *.spec
 # Installer logs
 pip-log.txt
 pip-delete-this-directory.txt
 # Unit test / coverage reports
 htmlcov/
 .tox/
 .nox/
 .coverage
 .coverage.*
 .cache
 nosetests.xml
 coverage.xml
 *.cover
 *.py,cover
 .hypothesis/
 .pytest_cache/
 cover/
 # Translations
 *.mo
 *.pot
 # Django stuff:
 *.log
 local_settings.py
 db.sqlite3
 db.sqlite3-journal
 # Flask stuff:
 instance/
 .webassets-cache
 # Scrapy stuff:
 .scrapy
 # Sphinx documentation
 docs/_build/
 # PyBuilder
 .pybuilder/
 target/
 # Jupyter Notebook
 .ipynb_checkpoints
 # IPython
 profile_default/
 ipython_config.py
 # pyenv
 #   For a library or package, you might want to ignore these files since the code is
 #   intended to run in multiple environments; otherwise, check them in:
 # .python-version
 # pipenv
 #   According to pypa/pipenv#598, it is recommended to include Pipfile.lock in version control.
 #   However, in case of collaboration, if having platform-specific dependencies or dependencies
 #   having no cross-platform support, pipenv may install dependencies that don't work, or not
 #   install all needed dependencies.
 #Pipfile.lock
 # poetry
 #   Similar to Pipfile.lock, it is generally recommended to include poetry.lock in version control.
 #   This is especially recommended for binary packages to ensure reproducibility, and is more
 #   commonly ignored for libraries.
 #   https://python-poetry.org/docs/basic-usage/#commit-your-poetrylock-file-to-version-control
 #poetry.lock
 # pdm
 #   Similar to Pipfile.lock, it is generally recommended to include pdm.lock in version control.
 #pdm.lock
 #   pdm stores project-wide configurations in .pdm.toml, but it is recommended to not include it
 #   in version control.
 #   https://pdm.fming.dev/#use-with-ide
 .pdm.toml
 # PEP 582; used by e.g. github.com/David-OConnor/pyflow and github.com/pdm-project/pdm
 __pypackages__/
 # Celery stuff
 celerybeat-schedule
 celerybeat.pid
 # SageMath parsed files
 *.sage.py
 # Environments
 .env
 .venv
 env/
 venv/
 ENV/
 env.bak/
 venv.bak/
 # Spyder project settings
 .spyderproject
 .spyproject
 # Rope project settings
 .ropeproject
 # mkdocs documentation
 /site
 # mypy
 .mypy_cache/
 .dmypy.json
 dmypy.json
 # Pyre type checker
 .pyre/
 # pytype static type analyzer
 .pytype/
 # Cython debug symbols
 cython_debug/
 # PyCharm
 #  JetBrains specific template is maintained in a separate JetBrains.gitignore that can
 #  be found at https://github.com/github/gitignore/blob/main/Global/JetBrains.gitignore
 #  and can be added to the global gitignore or merged into this file.  For a more nuclear
 #  option (not recommended) you can uncomment the following to ignore the entire idea folder.
 #.idea/
 *.xml
 *.iml
 .idea
 /temp_image/
 /browser_data/
 /data/
 */.DS_Store
 .vscode
 #New add
 test_parse.py
 test_soup.py
 test.htmlcov
--- a/28
+++ b/28
@ -0,0 +1,28 @@
 非商业使用许可证 1.0
 版权所有 (c) [2024] [relakkes@gmail.com]
 鉴于：
 1. 版权所有者拥有和控制本软件和相关文档文件（以下简称“软件”）的版权；
 2. 使用者希望使用该软件；
 3. 版权所有者愿意在本许可证所述的条件下授权使用者使用该软件；
 现因此，双方遵循相关法律法规，同意如下条款：
 授权范围：
 1. 版权所有者特此免费授予接受本许可证的任何自然人或法人（以下简称“使用者”）非独占的、不可转让的权利，在非商业目的下使用、复制、修改、合并本软件，前提是遵守以下条件。
 条件：
 1. 使用者必须在软件及其副本的所有合理显著位置包含上述版权声明和本许可证声明。
 2. 本软件不得用于任何商业目的，包括但不限于销售、营利或商业竞争。
 3. 未经版权所有者书面同意，不得将本软件用于任何商业用途。
 免责声明：
 1. 本软件按“现状”提供，不提供任何形式的明示或暗示保证，包括但不限于对适销性、特定用途的适用性和非侵权的保证。
 2. 在任何情况下，版权所有者均不对因使用本软件而产生的，或在任何方式上与本软件有关的任何直接、间接、偶然、特殊、示例性或后果性损害负责（包括但不限于采购替代品或服务；使用、数据或利润的损失；或业务中断），无论这些损害是如何引起的，以及无论是通过合同、严格责任还是侵权行为（包括疏忽或其他方式）产生的，即使已被告知此类损害的可能性。
 适用法律：
 1. 本许可证的解释和执行应遵循当地法律法规。
 2. 因本许可证引起的或与之相关的任何争议，双方应友好协商解决；协商不成时，任何一方可将争议提交至版权所有者所在地的人民法院诉讼解决。
 本许可证构成双方之间关于本软件的完整协议，取代并合并以前的讨论、交流和协议，无论是口头还是书面的。
--- a/README.md
+++ b/README.md
@ -0,0 +1,405 @@
 > **免责声明：**
 > 
 > 大家请以学习为目的使用本仓库，爬虫违法违规的案件：https://github.com/HiddenStrawberry/Crawler_Illegal_Cases_In_China  <br>
 >
 >本仓库的所有内容仅供学习和参考之用，禁止用于商业用途。任何人或组织不得将本仓库的内容用于非法用途或侵犯他人合法权益。本仓库所涉及的爬虫技术仅用于学习和研究，不得用于对其他平台进行大规模爬虫或其他非法行为。对于因使用本仓库内容而引起的任何法律责任，本仓库不承担任何责任。使用本仓库的内容即表示您同意本免责声明的所有条款和条件。
 > 点击查看更为详细的免责声明。[点击跳转](#disclaimer)
 # 仓库描述
 **小红书爬虫**，**抖音爬虫**， **快手爬虫**， **B站爬虫**， **微博爬虫**...。  
 目前能抓取小红书、抖音、快手、B站、微博的视频、图片、评论、点赞、转发等信息。
 原理：利用[playwright](https://playwright.dev/)搭桥，保留登录成功后的上下文浏览器环境，通过执行JS表达式获取一些加密参数
 通过使用此方式，免去了复现核心加密JS代码，逆向难度大大降低
 ## 功能列表
 | 平台  | 关键词搜索 | 指定帖子ID爬取 | 二级评论 | 指定创作者主页 | 登录态缓存 | IP代理池 | 生成评论词云图 |
 |-----|-------|----------|-----|--------|-------|-------|-------|
 | 小红书 | ✅     | ✅        | ✅   | ✅      | ✅     | ✅     | ✅    |
 | 抖音  | ✅     | ✅        | ✅    | ✅       | ✅     | ✅     | ✅    |
 | 快手  | ✅     | ✅        | ✅   | ✅      | ✅     | ✅     | ✅    |
 | B 站 | ✅     | ✅        | ✅   | ✅      | ✅     | ✅     | ✅    |
 | 微博  | ✅     | ✅        | ❌   | ❌      | ✅     | ✅     | ✅    |
 ## 使用方法
 ### 创建并激活 python 虚拟环境
   ```shell   
   # 进入项目根目录
   cd MediaCrawler
   # 创建虚拟环境
   # 注意python 版本需要3.7 - 3.9 高于该版本可能会出现一些依赖包兼容问题
   python -m venv venv
   # macos & linux 激活虚拟环境
   source venv/bin/activate
   # windows 激活虚拟环境
   venv\Scripts\activate
   ```
 ### 安装依赖库
   ```shell
   pip install -r requirements.txt
   ```
 ### 安装 playwright浏览器驱动
   ```shell
   playwright install
   ```
 ### 运行爬虫程序
   ```shell
   ### 项目默认是没有开启评论爬取模式，如需评论请在config/base_config.py中的 ENABLE_GET_COMMENTS 变量修改
   ### 一些其他支持项，也可以在config/base_config.py查看功能，写的有中文注释
   # 从配置文件中读取关键词搜索相关的帖子并爬取帖子信息与评论
   python main.py --platform xhs --lt qrcode --type search
   # 从配置文件中读取指定的帖子ID列表获取指定帖子的信息与评论信息
   python main.py --platform xhs --lt qrcode --type detail
   # 打开对应APP扫二维码登录
   # 其他平台爬虫使用示例，执行下面的命令查看
   python main.py --help    
   ```
 ### 数据保存
 - 支持保存到关系型数据库（Mysql、PgSQL等）
    - 执行 `python db.py` 初始化数据库数据库表结构（只在首次执行）
 - 支持保存到csv中（data/目录下）
 - 支持保存到json中（data/目录下）
 ## 开发者服务
 - 知识星球：沉淀高质量常见问题、最佳实践文档、多年编程+爬虫经验分享，提供付费知识星球服务，主动提问，作者会定期回答问题 (每天 1 快钱订阅我的知识服务)
  <p>
  <img alt="xingqiu" src="https://nm.zizhi1.com/static/img/8e1312d1f52f2e0ff436ea7196b4e27b.15555424244122T1.webp" style="width: auto;height: 400px" >
  </p>
  星球精选文章： 
  - [【独创】使用Playwright获取某音a_bogus参数流程（包含加密参数分析）](https://articles.zsxq.com/id_u89al50jk9x0.html)
  - [【独创】使用Playwright低成本获取某书X-s参数流程分析（当年的回忆录）](https://articles.zsxq.com/id_u4lcrvqakuc7.html)
  - [ MediaCrawler-基于抽象类设计重构项目缓存](https://articles.zsxq.com/id_4ju73oxewt9j.html)
  - [ 手把手带你撸一个自己的IP代理池](https://articles.zsxq.com/id_38fza371ladm.html) 
 - MediaCrawler视频课程：
  > 如果你想很快入门这个项目，或者想了具体实现原理，我推荐你看看这个视频课程，从设计出发一步步带你如何使用，门槛大大降低，同时也是对我开源的支持，如果你能支持我的课程，我将会非常开心～<br>
  > 课程售价非常非常的便宜，几杯咖啡的事儿.<br>
  > 课程介绍飞书文档链接：https://relakkes.feishu.cn/wiki/JUgBwdhIeiSbAwkFCLkciHdAnhh
 ## 感谢下列Sponsors对本仓库赞助
 - 感谢 [JetBrains](https://www.jetbrains.com/?from=gaowei-space/markdown-blog) 对本项目的支持！
 <a href="https://www.jetbrains.com/?from=NanmiCoder/MediaCrawler" target="_blank">
    <img src="https://resources.jetbrains.com/storage/products/company/brand/logos/jb_beam.png" width="100" height="100">
 </a>
 <br>
 - <a href="https://sider.ai/ad-land-redirect?source=github&p1=mi&p2=kk">通过注册这个款免费的GPT助手，帮我获取GPT4额度作为支持。也是我每天在用的一款chrome AI助手插件</a>
 成为赞助者，展示你的产品在这里，联系作者：relakkes@gmail.com
 ## MediaCrawler爬虫项目交流群：
 > 扫描下方我的个人微信，备注：github，拉你进MediaCrawler项目交流群(请一定备注：github，会有wx小助手自动拉群)
 > 
 > 如果图片展示不出来，可以直接添加我的微信号：yzglan
 <div style="max-width: 200px">  
 <p><img alt="relakkes_wechat" src="static/images/relakkes_weichat.JPG" style="width: 200px;height: 100%" ></p>
 </div>
 ## 运行报错常见问题Q&A
 > 遇到问题先自行搜索解决下，现在AI很火，用ChatGPT大多情况下能解决你的问题 [免费的ChatGPT](https://sider.ai/ad-land-redirect?source=github&p1=mi&p2=kk)  
 ➡️➡️➡️ [常见问题](docs/常见问题.md)
 dy和xhs使用Playwright登录现在会出现滑块验证 + 短信验证，手动过一下
 ## 项目代码结构
 ➡️➡️➡️ [项目代码结构说明](docs/项目代码结构.md)
 ## 代理IP使用说明
 ➡️➡️➡️ [代理IP使用说明](docs/代理使用.md)
 ## 词云图相关操作说明
 ➡️➡️➡️ [词云图相关说明](docs/关于词云图相关操作.md)
 ## 手机号登录说明
 ➡️➡️➡️ [手机号登录说明](docs/手机号登录说明.md)
 ## 打赏
 免费开源不易，如果项目帮到你了，可以给我打赏哦，您的支持就是我最大的动力！
 <div style="display: flex;justify-content: space-between;width: 100%">
    <p><img alt="打赏-微信" src="static/images/wechat_pay.jpeg" style="width: 200px;height: 100%" ></p>
    <p><img alt="打赏-支付宝" src="static/images/zfb_pay.png"   style="width: 200px;height: 100%" ></p>
 </div>
 ## 爬虫入门课程
 我新开的爬虫教程Github仓库 [CrawlerTutorial](https://github.com/NanmiCoder/CrawlerTutorial) ，感兴趣的朋友可以关注一下，持续更新，主打一个免费.
 ## 项目贡献者
 > 感谢你们的贡献，让项目变得更好！（贡献比较多的可以加我wx，免费拉你进我的知识星球，后期还有一些其他福利。）
 <!-- readme: contributors -start -->
 <table>
 	<tbody>
 		<tr>
            <td align="center">
                <a href="https://github.com/NanmiCoder">
                    <img src="https://avatars.githubusercontent.com/u/47178017?v=4" width="100;" alt="NanmiCoder"/>
                    <br />
                    <sub><b>程序员阿江-Relakkes</b></sub>
                </a>
            </td>
            <td align="center">
                <a href="https://github.com/leantli">
                    <img src="https://avatars.githubusercontent.com/u/117699758?v=4" width="100;" alt="leantli"/>
                    <br />
                    <sub><b>leantli</b></sub>
                </a>
            </td>
            <td align="center">
                <a href="https://github.com/Rosyrain">
                    <img src="https://avatars.githubusercontent.com/u/116946548?v=4" width="100;" alt="Rosyrain"/>
                    <br />
                    <sub><b>Rosyrain</b></sub>
                </a>
            </td>
            <td align="center">
                <a href="https://github.com/BaoZhuhan">
                    <img src="https://avatars.githubusercontent.com/u/140676370?v=4" width="100;" alt="BaoZhuhan"/>
                    <br />
                    <sub><b>Bao Zhuhan</b></sub>
                </a>
            </td>
            <td align="center">
                <a href="https://github.com/nelzomal">
                    <img src="https://avatars.githubusercontent.com/u/8512926?v=4" width="100;" alt="nelzomal"/>
                    <br />
                    <sub><b>zhounan</b></sub>
                </a>
            </td>
            <td align="center">
                <a href="https://github.com/Hiro-Lin">
                    <img src="https://avatars.githubusercontent.com/u/40111864?v=4" width="100;" alt="Hiro-Lin"/>
                    <br />
                    <sub><b>HIRO</b></sub>
                </a>
            </td>
 		</tr>
 		<tr>
            <td align="center">
                <a href="https://github.com/PeanutSplash">
                    <img src="https://avatars.githubusercontent.com/u/98582625?v=4" width="100;" alt="PeanutSplash"/>
                    <br />
                    <sub><b>PeanutSplash</b></sub>
                </a>
            </td>
            <td align="center">
                <a href="https://github.com/Ermeng98">
                    <img src="https://avatars.githubusercontent.com/u/55784769?v=4" width="100;" alt="Ermeng98"/>
                    <br />
                    <sub><b>Ermeng</b></sub>
                </a>
            </td>
            <td align="center">
                <a href="https://github.com/henryhyn">
                    <img src="https://avatars.githubusercontent.com/u/5162443?v=4" width="100;" alt="henryhyn"/>
                    <br />
                    <sub><b>Henry He</b></sub>
                </a>
            </td>
            <td align="center">
                <a href="https://github.com/Akiqqqqqqq">
                    <img src="https://avatars.githubusercontent.com/u/51102894?v=4" width="100;" alt="Akiqqqqqqq"/>
                    <br />
                    <sub><b>leonardoqiuyu</b></sub>
                </a>
            </td>
            <td align="center">
                <a href="https://github.com/jayeeliu">
                    <img src="https://avatars.githubusercontent.com/u/77389?v=4" width="100;" alt="jayeeliu"/>
                    <br />
                    <sub><b>jayeeliu</b></sub>
                </a>
            </td>
            <td align="center">
                <a href="https://github.com/ZuWard">
                    <img src="https://avatars.githubusercontent.com/u/38209256?v=4" width="100;" alt="ZuWard"/>
                    <br />
                    <sub><b>ZuWard</b></sub>
                </a>
            </td>
 		</tr>
 		<tr>
            <td align="center">
                <a href="https://github.com/Zzendrix">
                    <img src="https://avatars.githubusercontent.com/u/154900254?v=4" width="100;" alt="Zzendrix"/>
                    <br />
                    <sub><b>Zendrix</b></sub>
                </a>
            </td>
            <td align="center">
                <a href="https://github.com/chunpat">
                    <img src="https://avatars.githubusercontent.com/u/19848304?v=4" width="100;" alt="chunpat"/>
                    <br />
                    <sub><b>zhangzhenpeng</b></sub>
                </a>
            </td>
            <td align="center">
                <a href="https://github.com/tanpenggood">
                    <img src="https://avatars.githubusercontent.com/u/37927946?v=4" width="100;" alt="tanpenggood"/>
                    <br />
                    <sub><b>Sam Tan</b></sub>
                </a>
            </td>
            <td align="center">
                <a href="https://github.com/xbsheng">
                    <img src="https://avatars.githubusercontent.com/u/56357338?v=4" width="100;" alt="xbsheng"/>
                    <br />
                    <sub><b>xbsheng</b></sub>
                </a>
            </td>
            <td align="center">
                <a href="https://github.com/yangrq1018">
                    <img src="https://avatars.githubusercontent.com/u/25074163?v=4" width="100;" alt="yangrq1018"/>
                    <br />
                    <sub><b>Martin</b></sub>
                </a>
            </td>
            <td align="center">
                <a href="https://github.com/zhihuiio">
                    <img src="https://avatars.githubusercontent.com/u/165655688?v=4" width="100;" alt="zhihuiio"/>
                    <br />
                    <sub><b>zhihuiio</b></sub>
                </a>
            </td>
 		</tr>
 		<tr>
            <td align="center">
                <a href="https://github.com/renaissancezyc">
                    <img src="https://avatars.githubusercontent.com/u/118403818?v=4" width="100;" alt="renaissancezyc"/>
                    <br />
                    <sub><b>Ren</b></sub>
                </a>
            </td>
            <td align="center">
                <a href="https://github.com/Tianci-King">
                    <img src="https://avatars.githubusercontent.com/u/109196852?v=4" width="100;" alt="Tianci-King"/>
                    <br />
                    <sub><b>Wang Tianci</b></sub>
                </a>
            </td>
            <td align="center">
                <a href="https://github.com/Styunlen">
                    <img src="https://avatars.githubusercontent.com/u/30810222?v=4" width="100;" alt="Styunlen"/>
                    <br />
                    <sub><b>Styunlen</b></sub>
                </a>
            </td>
            <td align="center">
                <a href="https://github.com/Schofi">
                    <img src="https://avatars.githubusercontent.com/u/33537727?v=4" width="100;" alt="Schofi"/>
                    <br />
                    <sub><b>Schofi</b></sub>
                </a>
            </td>
            <td align="center">
                <a href="https://github.com/Klu5ure">
                    <img src="https://avatars.githubusercontent.com/u/166240879?v=4" width="100;" alt="Klu5ure"/>
                    <br />
                    <sub><b>Klu5ure</b></sub>
                </a>
            </td>
            <td align="center">
                <a href="https://github.com/keeper-jie">
                    <img src="https://avatars.githubusercontent.com/u/33612777?v=4" width="100;" alt="keeper-jie"/>
                    <br />
                    <sub><b>Kermit</b></sub>
                </a>
            </td>
 		</tr>
 		<tr>
            <td align="center">
                <a href="https://github.com/kexinoh">
                    <img src="https://avatars.githubusercontent.com/u/91727108?v=4" width="100;" alt="kexinoh"/>
                    <br />
                    <sub><b>KEXNA</b></sub>
                </a>
            </td>
            <td align="center">
                <a href="https://github.com/aa65535">
                    <img src="https://avatars.githubusercontent.com/u/5417786?v=4" width="100;" alt="aa65535"/>
                    <br />
                    <sub><b>Jian Chang</b></sub>
                </a>
            </td>
            <td align="center">
                <a href="https://github.com/522109452">
                    <img src="https://avatars.githubusercontent.com/u/16929874?v=4" width="100;" alt="522109452"/>
                    <br />
                    <sub><b>tianqing</b></sub>
                </a>
            </td>
 		</tr>
 	<tbody>
 </table>
 <!-- readme: contributors -end -->
 ## star 趋势图
 - 如果该项目对你有帮助，star一下 ❤️❤️❤️
 [![Star History Chart](https://api.star-history.com/svg?repos=NanmiCoder/MediaCrawler&type=Date)](https://star-history.com/#NanmiCoder/MediaCrawler&Date)
 ## 参考
 - xhs客户端 [ReaJason的xhs仓库](https://github.com/ReaJason/xhs)
 - 短信转发 [参考仓库](https://github.com/pppscn/SmsForwarder)
 - 内网穿透工具 [ngrok](https://ngrok.com/docs/)
 ## 免责声明
 <div id="disclaimer"> 
 ### 1. 项目目的与性质
 本项目（以下简称“本项目”）是作为一个技术研究与学习工具而创建的，旨在探索和学习网络数据采集技术。本项目专注于自媒体平台的数据爬取技术研究，旨在提供给学习者和研究者作为技术交流之用。
 ### 2. 法律合规性声明
 本项目开发者（以下简称“开发者”）郑重提醒用户在下载、安装和使用本项目时，严格遵守中华人民共和国相关法律法规，包括但不限于《中华人民共和国网络安全法》、《中华人民共和国反间谍法》等所有适用的国家法律和政策。用户应自行承担一切因使用本项目而可能引起的法律责任。
 ### 3. 使用目的限制
 本项目严禁用于任何非法目的或非学习、非研究的商业行为。本项目不得用于任何形式的非法侵入他人计算机系统，不得用于任何侵犯他人知识产权或其他合法权益的行为。用户应保证其使用本项目的目的纯属个人学习和技术研究，不得用于任何形式的非法活动。
 ### 4. 免责声明
 开发者已尽最大努力确保本项目的正当性及安全性，但不对用户使用本项目可能引起的任何形式的直接或间接损失承担责任。包括但不限于由于使用本项目而导致的任何数据丢失、设备损坏、法律诉讼等。
 ### 5. 知识产权声明
 本项目的知识产权归开发者所有。本项目受到著作权法和国际著作权条约以及其他知识产权法律和条约的保护。用户在遵守本声明及相关法律法规的前提下，可以下载和使用本项目。
 ### 6. 最终解释权
 关于本项目的最终解释权归开发者所有。开发者保留随时更改或更新本免责声明的权利，恕不另行通知。
 </div>
--- a/README.txt
+++ b/README.txt
@ -0,0 +1,4 @@
 小红书核心功能media_platform/xhs/core.py
 增加了爬推荐的功能:
 python main.py --platform xhs --lt qrcode --type explore
 具体函数在core.py中的get_explore函数中
--- a/async_db.py
+++ b/async_db.py
@ -0,0 +1,96 @@
 # -*- coding: utf-8 -*-
 # @Author  : relakkes@gmail.com
 # @Time    : 2024/4/6 14:21
 # @Desc    : 异步Aiomysql的增删改查封装
 from typing import Any, Dict, List, Union
 import aiomysql
 class AsyncMysqlDB:
    def __init__(self, pool: aiomysql.Pool) -> None:
        self.__pool = pool
    async def query(self, sql: str, *args: Union[str, int]) -> List[Dict[str, Any]]:
        """
        从给定的 SQL 中查询记录，返回的是一个列表
        :param sql: 查询的sql
        :param args: sql中传递动态参数列表
        :return:
        """
        async with self.__pool.acquire() as conn:
            async with conn.cursor(aiomysql.DictCursor) as cur:
                await cur.execute(sql, args)
                data = await cur.fetchall()
                return data or []
    async def get_first(self, sql: str, *args: Union[str, int]) -> Union[Dict[str, Any], None]:
        """
        从给定的 SQL 中查询记录，返回的是符合条件的第一个结果
        :param sql: 查询的sql
        :param args:sql中传递动态参数列表
        :return:
        """
        async with self.__pool.acquire() as conn:
            async with conn.cursor(aiomysql.DictCursor) as cur:
                await cur.execute(sql, args)
                data = await cur.fetchone()
                return data
    async def item_to_table(self, table_name: str, item: Dict[str, Any]) -> int:
        """
        表中插入数据
        :param table_name: 表名
        :param item: 一条记录的字典信息
        :return:
        """
        fields = list(item.keys())
        values = list(item.values())
        fields = [f'`{field}`' for field in fields]
        fieldstr = ','.join(fields)
        valstr = ','.join(['%s'] * len(item))
        sql = "INSERT INTO %s (%s) VALUES(%s)" % (table_name, fieldstr, valstr)
        async with self.__pool.acquire() as conn:
            async with conn.cursor(aiomysql.DictCursor) as cur:
                await cur.execute(sql, values)
                lastrowid = cur.lastrowid
                return lastrowid
    async def update_table(self, table_name: str, updates: Dict[str, Any], field_where: str,
                           value_where: Union[str, int, float]) -> int:
        """
        更新指定表的记录
        :param table_name: 表名
        :param updates: 需要更新的字段和值的 key - value 映射
        :param field_where: update 语句 where 条件中的字段名
        :param value_where: update 语句 where 条件中的字段值
        :return:
        """
        upsets = []
        values = []
        for k, v in updates.items():
            s = '`%s`=%%s' % k
            upsets.append(s)
            values.append(v)
        upsets = ','.join(upsets)
        sql = 'UPDATE %s SET %s WHERE %s="%s"' % (
            table_name,
            upsets,
            field_where, value_where,
        )
        async with self.__pool.acquire() as conn:
            async with conn.cursor() as cur:
                rows = await cur.execute(sql, values)
                return rows
    async def execute(self, sql: str, *args: Union[str, int]) -> int:
        """
        需要更新、写入等操作的 excute 执行语句
        :param sql:
        :param args:
        :return:
        """
        async with self.__pool.acquire() as conn:
            async with conn.cursor() as cur:
                rows = await cur.execute(sql, args)
                return rows
--- a/base/init.py
+++ b/base/init.py
--- a/base/base_crawler.py
+++ b/base/base_crawler.py
@ -0,0 +1,71 @@
 from abc import ABC, abstractmethod
 from typing import Dict, Optional
 from playwright.async_api import BrowserContext, BrowserType
 class AbstractCrawler(ABC):
    @abstractmethod
    async def start(self):
        pass
    @abstractmethod
    async def search(self):
        pass
    @abstractmethod
    async def launch_browser(self, chromium: BrowserType, playwright_proxy: Optional[Dict], user_agent: Optional[str],
                             headless: bool = True) -> BrowserContext:
        pass
 class AbstractLogin(ABC):
    @abstractmethod
    async def begin(self):
        pass
    @abstractmethod
    async def login_by_qrcode(self):
        pass
    @abstractmethod
    async def login_by_mobile(self):
        pass
    @abstractmethod
    async def login_by_cookies(self):
        pass
 class AbstractStore(ABC):
    @abstractmethod
    async def store_content(self, content_item: Dict):
        pass
    @abstractmethod
    async def store_comment(self, comment_item: Dict):
        pass
    # TODO support all platform
    # only xhs is supported, so @abstractmethod is commented
    # @abstractmethod
    async def store_creator(self, creator: Dict):
        pass
 class AbstractStoreImage(ABC):
    # TODO: support all platform
    # only weibo is supported
    # @abstractmethod
    async def store_image(self, image_content_item: Dict):
        pass
 class AbstractApiClient(ABC):
    @abstractmethod
    async def request(self, method, url, **kwargs):
        pass
    @abstractmethod
    async def update_cookies(self, browser_context: BrowserContext):
        pass
--- a/cache/init.py
+++ b/cache/init.py
--- a/cache/abs_cache.py
+++ b/cache/abs_cache.py
@ -0,0 +1,42 @@
 # -*- coding: utf-8 -*-
 # @Author  : relakkes@gmail.com
 # @Name    : 程序员阿江-Relakkes
 # @Time    : 2024/6/2 11:06
 # @Desc    : 抽象类
 from abc import ABC, abstractmethod
 from typing import Any, List, Optional
 class AbstractCache(ABC):
    @abstractmethod
    def get(self, key: str) -> Optional[Any]:
        """
        从缓存中获取键的值。
        这是一个抽象方法。子类必须实现这个方法。
        :param key: 键
        :return:
        """
        raise NotImplementedError
    @abstractmethod
    def set(self, key: str, value: Any, expire_time: int) -> None:
        """
        将键的值设置到缓存中。
        这是一个抽象方法。子类必须实现这个方法。
        :param key: 键
        :param value: 值
        :param expire_time: 过期时间
        :return:
        """
        raise NotImplementedError
    @abstractmethod
    def keys(self, pattern: str) -> List[str]:
        """
        获取所有符合pattern的key
        :param pattern: 匹配模式
        :return:
        """
        raise NotImplementedError
--- a/cache/cache_factory.py
+++ b/cache/cache_factory.py
@ -0,0 +1,29 @@
 # -*- coding: utf-8 -*-
 # @Author  : relakkes@gmail.com
 # @Name    : 程序员阿江-Relakkes
 # @Time    : 2024/6/2 11:23
 # @Desc    :
 class CacheFactory:
    """
    缓存工厂类
    """
    @staticmethod
    def create_cache(cache_type: str, *args, **kwargs):
        """
        创建缓存对象
        :param cache_type: 缓存类型
        :param args: 参数
        :param kwargs: 关键字参数
        :return:
        """
        if cache_type == 'memory':
            from .local_cache import ExpiringLocalCache
            return ExpiringLocalCache(*args, **kwargs)
        elif cache_type == 'redis':
            from .redis_cache import RedisCache
            return RedisCache()
        else:
            raise ValueError(f'Unknown cache type: {cache_type}')
--- a/cache/local_cache.py
+++ b/cache/local_cache.py
@ -0,0 +1,120 @@
 # -*- coding: utf-8 -*-
 # @Author  : relakkes@gmail.com
 # @Name    : 程序员阿江-Relakkes
 # @Time    : 2024/6/2 11:05
 # @Desc    : 本地缓存
 import asyncio
 import time
 from typing import Any, Dict, List, Optional, Tuple
 from cache.abs_cache import AbstractCache
 class ExpiringLocalCache(AbstractCache):
    def __init__(self, cron_interval: int = 10):
        """
        初始化本地缓存
        :param cron_interval: 定时清楚cache的时间间隔
        :return:
        """
        self._cron_interval = cron_interval
        self._cache_container: Dict[str, Tuple[Any, float]] = {}
        self._cron_task: Optional[asyncio.Task] = None
        # 开启定时清理任务
        self._schedule_clear()
    def __del__(self):
        """
        析构函数，清理定时任务
        :return:
        """
        if self._cron_task is not None:
            self._cron_task.cancel()
    def get(self, key: str) -> Optional[Any]:
        """
        从缓存中获取键的值
        :param key:
        :return:
        """
        value, expire_time = self._cache_container.get(key, (None, 0))
        if value is None:
            return None
        # 如果键已过期，则删除键并返回None
        if expire_time < time.time():
            del self._cache_container[key]
            return None
        return value
    def set(self, key: str, value: Any, expire_time: int) -> None:
        """
        将键的值设置到缓存中
        :param key:
        :param value:
        :param expire_time:
        :return:
        """
        self._cache_container[key] = (value, time.time() + expire_time)
    def keys(self, pattern: str) -> List[str]:
        """
        获取所有符合pattern的key
        :param pattern: 匹配模式
        :return:
        """
        if pattern == '*':
            return list(self._cache_container.keys())
        # 本地缓存通配符暂时将*替换为空
        if '*' in pattern:
            pattern = pattern.replace('*', '')
        return [key for key in self._cache_container.keys() if pattern in key]
    def _schedule_clear(self):
        """
        开启定时清理任务,
        :return:
        """
        try:
            loop = asyncio.get_event_loop()
        except RuntimeError:
            loop = asyncio.new_event_loop()
            asyncio.set_event_loop(loop)
        self._cron_task = loop.create_task(self._start_clear_cron())
    def _clear(self):
        """
        根据过期时间清理缓存
        :return:
        """
        for key, (value, expire_time) in self._cache_container.items():
            if expire_time < time.time():
                del self._cache_container[key]
    async def _start_clear_cron(self):
        """
        开启定时清理任务
        :return:
        """
        while True:
            self._clear()
            await asyncio.sleep(self._cron_interval)
 if __name__ == '__main__':
    cache = ExpiringLocalCache(cron_interval=2)
    cache.set('name', '程序员阿江-Relakkes', 3)
    print(cache.get('key'))
    print(cache.keys("*"))
    time.sleep(4)
    print(cache.get('key'))
    del cache
    time.sleep(1)
    print("done")
--- a/cache/redis_cache.py
+++ b/cache/redis_cache.py
@ -0,0 +1,76 @@
 # -*- coding: utf-8 -*-
 # @Author  : relakkes@gmail.com
 # @Name    : 程序员阿江-Relakkes
 # @Time    : 2024/5/29 22:57
 # @Desc    : RedisCache实现
 import pickle
 import time
 from typing import Any, List
 from redis import Redis
 from cache.abs_cache import AbstractCache
 from config import db_config
 class RedisCache(AbstractCache):
    def __init__(self) -> None:
        # 连接redis, 返回redis客户端
        self._redis_client = self._connet_redis()
    @staticmethod
    def _connet_redis() -> Redis:
        """
        连接redis, 返回redis客户端, 这里按需配置redis连接信息
        :return:
        """
        return Redis(
            host=db_config.REDIS_DB_HOST,
            port=db_config.REDIS_DB_PORT,
            db=db_config.REDIS_DB_NUM,
            password=db_config.REDIS_DB_PWD,
        )
    def get(self, key: str) -> Any:
        """
        从缓存中获取键的值, 并且反序列化
        :param key:
        :return:
        """
        value = self._redis_client.get(key)
        if value is None:
            return None
        return pickle.loads(value)
    def set(self, key: str, value: Any, expire_time: int) -> None:
        """
        将键的值设置到缓存中, 并且序列化
        :param key:
        :param value:
        :param expire_time:
        :return:
        """
        self._redis_client.set(key, pickle.dumps(value), ex=expire_time)
    def keys(self, pattern: str) -> List[str]:
        """
        获取所有符合pattern的key
        """
        return [key.decode() for key in self._redis_client.keys(pattern)]
 if __name__ == '__main__':
    redis_cache = RedisCache()
    # basic usage
    redis_cache.set("name", "程序员阿江-Relakkes", 1)
    print(redis_cache.get("name"))  # Relakkes
    print(redis_cache.keys("*"))  # ['name']
    time.sleep(2)
    print(redis_cache.get("name"))  # None
    # special python type usage
    # list
    redis_cache.set("list", [1, 2, 3], 10)
    _value = redis_cache.get("list")
    print(_value, f"value type:{type(_value)}")  # [1, 2, 3]
--- a/cmd_arg/init.py
+++ b/cmd_arg/init.py
@ -0,0 +1 @@
 from .arg import *
--- a/cmd_arg/arg.py
+++ b/cmd_arg/arg.py
@ -0,0 +1,40 @@
 import argparse
 import config
 from tools.utils import str2bool
 async def parse_cmd():
    # 读取command arg
    parser = argparse.ArgumentParser(description='Media crawler program.')
    parser.add_argument('--platform', type=str, help='Media platform select (xhs | dy | ks | bili | wb)',
                        choices=["xhs", "dy", "ks", "bili", "wb"], default=config.PLATFORM)
    parser.add_argument('--lt', type=str, help='Login type (qrcode | phone | cookie)',
                        choices=["qrcode", "phone", "cookie"], default=config.LOGIN_TYPE)
    parser.add_argument('--type', type=str, help='crawler type (search | detail | creator)',
                        choices=["search", "detail", "creator", "explore"], default=config.CRAWLER_TYPE)
    parser.add_argument('--start', type=int,
                        help='number of start page', default=config.START_PAGE)
    parser.add_argument('--keywords', type=str,
                        help='please input keywords', default=config.KEYWORDS)
    parser.add_argument('--get_comment', type=str2bool,
                        help='''whether to crawl level one comment, supported values case insensitive ('yes', 'true', 't', 'y', '1', 'no', 'false', 'f', 'n', '0')''', default=config.ENABLE_GET_COMMENTS)
    parser.add_argument('--get_sub_comment', type=str2bool,
                        help=''''whether to crawl level two comment, supported values case insensitive ('yes', 'true', 't', 'y', '1', 'no', 'false', 'f', 'n', '0')''', default=config.ENABLE_GET_SUB_COMMENTS)
    parser.add_argument('--save_data_option', type=str,
                        help='where to save the data (csv or db or json)', choices=['csv', 'db', 'json'], default=config.SAVE_DATA_OPTION)
    parser.add_argument('--cookies', type=str,
                        help='cookies used for cookie login type', default=config.COOKIES)
    args = parser.parse_args()
    # override config
    config.PLATFORM = args.platform
    config.LOGIN_TYPE = args.lt
    config.CRAWLER_TYPE = args.type
    config.START_PAGE = args.start
    config.KEYWORDS = args.keywords
    config.ENABLE_GET_COMMENTS = args.get_comment
    config.ENABLE_GET_SUB_COMMENTS = args.get_sub_comment
    config.SAVE_DATA_OPTION = args.save_data_option
    config.COOKIES = args.cookies
--- a/config/init.py
+++ b/config/init.py
@ -0,0 +1,2 @@
 from .base_config import *
 from .db_config import *
--- a/config/base_config.py
+++ b/config/base_config.py
@ -0,0 +1,131 @@
 # 基础配置
 PLATFORM = "xhs"
 KEYWORDS = "python,golang"
 LOGIN_TYPE = "qrcode"  # qrcode or phone or cookie
 COOKIES = ""
 # 具体值参见media_platform.xxx.field下的枚举值，暂时只支持小红书
 SORT_TYPE = "popularity_descending"
 # 具体值参见media_platform.xxx.field下的枚举值，暂时只支持抖音
 PUBLISH_TIME_TYPE = 0
 CRAWLER_TYPE = "search"  # 爬取类型，search(关键词搜索) | detail(帖子详情)| creator(创作者主页数据)
 # 是否开启 IP 代理
 ENABLE_IP_PROXY = False
 # 代理IP池数量
 IP_PROXY_POOL_COUNT = 2
 # 代理IP提供商名称
 IP_PROXY_PROVIDER_NAME = "kuaidaili"
 # 设置为True不会打开浏览器（无头浏览器）
 # 设置False会打开一个浏览器
 # 小红书如果一直扫码登录不通过，打开浏览器手动过一下滑动验证码
 # 抖音如果一直提示失败，打开浏览器看下是否扫码登录之后出现了手机号验证，如果出现了手动过一下再试。
 HEADLESS = False
 # 是否保存登录状态
 SAVE_LOGIN_STATE = True
 # 数据保存类型选项配置,支持三种类型：csv、db、json
 SAVE_DATA_OPTION = "json"  # csv or db or json
 # 用户浏览器缓存的浏览器文件配置
 USER_DATA_DIR = "%s_user_data_dir"  # %s will be replaced by platform name
 # 爬取开始页数 默认从第一页开始
 START_PAGE = 1
 # 爬取视频/帖子的数量控制
 CRAWLER_MAX_NOTES_COUNT = 20
 # 并发爬虫数量控制
 MAX_CONCURRENCY_NUM = 4
 # 是否开启爬图片模式, 默认不开启爬图片
 ENABLE_GET_IMAGES = False
 # 是否开启爬评论模式, 默认不开启爬评论
 ENABLE_GET_COMMENTS = False
 # 是否开启爬二级评论模式, 默认不开启爬二级评论, 目前仅支持 xhs, bilibili
 # 老版本项目使用了 db, 则需参考 schema/tables.sql line 287 增加表字段
 ENABLE_GET_SUB_COMMENTS = False
 # 指定小红书需要爬虫的笔记ID列表
 # 667a0c27000000001e010d42
 XHS_SPECIFIED_ID_LIST = [
    "6422c2750000000027000d88",
    "64ca1b73000000000b028dd2",
    "630d5b85000000001203ab41",
    # ........................
 ]
 # 指定抖音需要爬取的ID列表
 DY_SPECIFIED_ID_LIST = [
    "7280854932641664319",
    "7202432992642387233"
    # ........................
 ]
 # 指定快手平台需要爬取的ID列表
 KS_SPECIFIED_ID_LIST = [
    "3xf8enb8dbj6uig",
    "3x6zz972bchmvqe"
 ]
 # 指定B站平台需要爬取的视频bvid列表
 BILI_SPECIFIED_ID_LIST = [
    "BV1d54y1g7db",
    "BV1Sz4y1U77N",
    "BV14Q4y1n7jz",
    # ........................
 ]
 # 指定微博平台需要爬取的帖子列表
 WEIBO_SPECIFIED_ID_LIST = [
    "4982041758140155",
    # ........................
 ]
 # 指定小红书创作者ID列表
 XHS_CREATOR_ID_LIST = [
    "5c4548d80000000006030727",
    # "63e36c9a000000002703502b",
    # ........................
 ]
 # 指定Dy创作者ID列表(sec_id)
 DY_CREATOR_ID_LIST = [
    "MS4wLjABAAAATJPY7LAlaa5X-c8uNdWkvz0jUGgpw4eeXIwu_8BhvqE",
    # ........................
 ]
 # 指定bili创作者ID列表(sec_id)
 BILI_CREATOR_ID_LIST = [
    "20813884",
    # ........................
 ]
 # 指定快手创作者ID列表
 KS_CREATOR_ID_LIST = [
    "3x4sm73aye7jq7i",
    # ........................
 ]
 #词云相关
 #是否开启生成评论词云图
 ENABLE_GET_WORDCLOUD = False
 # 自定义词语及其分组
 #添加规则：xx:yy 其中xx为自定义添加的词组，yy为将xx该词组分到的组名。
 CUSTOM_WORDS = {
    '零几': '年份',  # 将“零几”识别为一个整体
    '高频词': '专业术语'  # 示例自定义词
 }
 #停用(禁用)词文件路径
 STOP_WORDS_FILE = "./docs/hit_stopwords.txt"
 #中文字体文件路径
 FONT_PATH= "./docs/STZHONGS.TTF"
--- a/config/db_config.py
+++ b/config/db_config.py
@ -0,0 +1,20 @@
 import os
 # mysql config
 RELATION_DB_PWD = os.getenv("RELATION_DB_PWD", "123456")
 RELATION_DB_USER = os.getenv("RELATION_DB_USER", "root")
 RELATION_DB_HOST = os.getenv("RELATION_DB_HOST", "localhost")
 RELATION_DB_PORT = os.getenv("RELATION_DB_PORT", "3306")
 RELATION_DB_NAME = os.getenv("RELATION_DB_NAME", "media_crawler")
 RELATION_DB_URL = f"mysql://{RELATION_DB_USER}:{RELATION_DB_PWD}@{RELATION_DB_HOST}:{RELATION_DB_PORT}/{RELATION_DB_NAME}"
 # redis config
 REDIS_DB_HOST = "127.0.0.1"  # your redis host
 REDIS_DB_PWD = os.getenv("REDIS_DB_PWD", "123456")  # your redis password
 REDIS_DB_PORT = os.getenv("REDIS_DB_PORT", 6379)  # your redis port
 REDIS_DB_NUM = os.getenv("REDIS_DB_NUM", 0)  # your redis db num
 # cache type
 CACHE_TYPE_REDIS = "redis"
 CACHE_TYPE_MEMORY = "memory"
--- a/db.py
+++ b/db.py
@ -0,0 +1,96 @@
 # -*- coding: utf-8 -*-
 # @Author  : relakkes@gmail.com
 # @Time    : 2024/4/6 14:54
 # @Desc    : mediacrawler db 管理
 import asyncio
 from typing import Dict
 from urllib.parse import urlparse
 import aiofiles
 import aiomysql
 import config
 from async_db import AsyncMysqlDB
 from tools import utils
 from var import db_conn_pool_var, media_crawler_db_var
 def parse_mysql_url(mysql_url) -> Dict:
    """
    从配置文件中解析db链接url，给到aiomysql用，因为aiomysql不支持直接以URL的方式传递链接信息。
    Args:
        mysql_url: mysql://root:{RELATION_DB_PWD}@localhost:3306/media_crawler
    Returns:
    """
    parsed_url = urlparse(mysql_url)
    db_params = {
        'host': parsed_url.hostname,
        'port': parsed_url.port or 3306,
        'user': parsed_url.username,
        'password': parsed_url.password,
        'db': parsed_url.path.lstrip('/')
    }
    return db_params
 async def init_mediacrawler_db():
    """
    初始化数据库链接池对象，并将该对象塞给media_crawler_db_var上下文变量
    Returns:
    """
    db_conn_params = parse_mysql_url(config.RELATION_DB_URL)
    pool = await aiomysql.create_pool(
        autocommit=True,
        **db_conn_params
    )
    async_db_obj = AsyncMysqlDB(pool)
    # 将连接池对象和封装的CRUD sql接口对象放到上下文变量中
    db_conn_pool_var.set(pool)
    media_crawler_db_var.set(async_db_obj)
 async def init_db():
    """
    初始化db连接池
    Returns:
    """
    utils.logger.info("[init_db] start init mediacrawler db connect object")
    await init_mediacrawler_db()
    utils.logger.info("[init_db] end init mediacrawler db connect object")
 async def close():
    """
    关闭连接池
    Returns:
    """
    utils.logger.info("[close] close mediacrawler db pool")
    db_pool: aiomysql.Pool = db_conn_pool_var.get()
    if db_pool is not None:
        db_pool.close()
 async def init_table_schema():
    """
    用来初始化数据库表结构，请在第一次需要创建表结构的时候使用，多次执行该函数会将已有的表以及数据全部删除
    Returns:
    """
    utils.logger.info("[init_table_schema] begin init mysql table schema ...")
    await init_mediacrawler_db()
    async_db_obj: AsyncMysqlDB = media_crawler_db_var.get()
    async with aiofiles.open("schema/tables.sql", mode="r") as f:
        schema_sql = await f.read()
        await async_db_obj.execute(schema_sql)
        utils.logger.info("[init_table_schema] mediacrawler table schema init successful")
        await close()
 if __name__ == '__main__':
    asyncio.get_event_loop().run_until_complete(init_table_schema())
--- a/docs/STZHONGS.TTF
+++ b/docs/STZHONGS.TTF
--- a/docs/hit_stopwords.txt
+++ b/docs/hit_stopwords.txt
@ -0,0 +1,768 @@
 \n
 ———
 》），
 ）÷（１－
 ”，
 ）、
 ＝（
 :
 →
 ℃ 
 &
 *
 一一
 ~~~~
 ’
 . 
 『
 .一
 ./
 -- 
 』
 ＝″
 【
 ［＊］
 ｝＞
 ［⑤］］
 ［①Ｄ］
 ｃ］
 ｎｇ昉
 ＊
 //
 ［
 ］
 ［②ｅ］
 ［②ｇ］
 ＝｛
 }
 ，也 
 ‘
 Ａ
 ［①⑥］
 ［②Ｂ］ 
 ［①ａ］
 ［④ａ］
 ［①③］
 ［③ｈ］
 ③］
 １． 
 －－ 
 ［②ｂ］
 ’‘ 
 ××× 
 ［①⑧］
 ０：２ 
 ＝［
 ［⑤ｂ］
 ［②ｃ］ 
 ［④ｂ］
 ［②③］
 ［③ａ］
 ［④ｃ］
 ［①⑤］
 ［①⑦］
 ［①ｇ］
 ∈［ 
 ［①⑨］
 ［①④］
 ［①ｃ］
 ［②ｆ］
 ［②⑧］
 ［②①］
 ［①Ｃ］
 ［③ｃ］
 ［③ｇ］
 ［②⑤］
 ［②②］
 一.
 ［①ｈ］
 .数
 ［］
 ［①Ｂ］
 数/
 ［①ｉ］
 ［③ｅ］
 ［①①］
 ［④ｄ］
 ［④ｅ］
 ［③ｂ］
 ［⑤ａ］
 ［①Ａ］
 ［②⑧］
 ［②⑦］
 ［①ｄ］
 ［②ｊ］
 〕〔
 ］［
 ://
 ′∈
 ［②④
 ［⑤ｅ］
 １２％
 ｂ］
 ...
 ...................
 …………………………………………………③
 ＺＸＦＩＴＬ
 ［③Ｆ］
 」
 ［①ｏ］
 ］∧′＝［ 
 ∪φ∈
 ′｜
 ｛－
 ②ｃ
 ｝
 ［③①］
 Ｒ．Ｌ．
 ［①Ｅ］
 Ψ
 －［＊］－
 ↑
 .日 
 ［②ｄ］
 ［②
 ［②⑦］
 ［②②］
 ［③ｅ］
 ［①ｉ］
 ［①Ｂ］
 ［①ｈ］
 ［①ｄ］
 ［①ｇ］
 ［①②］
 ［②ａ］
 ｆ］
 ［⑩］
 ａ］
 ［①ｅ］
 ［②ｈ］
 ［②⑥］
 ［③ｄ］
 ［②⑩］
 ｅ］
 〉
 】
 元／吨
 ［②⑩］
 ２．３％
 ５：０  
 ［①］
 ::
 ［②］
 ［③］
 ［④］
 ［⑤］
 ［⑥］
 ［⑦］
 ［⑧］
 ［⑨］ 
 ……
 ——
 ?
 、
 。
 “
 ”
 《
 》
 ！
 ，
 ：
 ；
 ？
 ．
 ,
 ．
 '
 ? 
 ·
 ———
 ──
 ? 
 —
 <
 >
 （
 ）
 〔
 〕
 [
 ]
 (
 )
 -
 +
 ～
 ×
 ／
 /
 ①
 ②
 ③
 ④
 ⑤
 ⑥
 ⑦
 ⑧
 ⑨
 ⑩
 Ⅲ
 В
 "
 ;
 #
@
 γ
 μ
 φ
 φ．
 × 
 Δ
 ■
 ▲
 sub
 exp 
 sup
 sub
 Lex 
 ＃
 ％
 ＆
 ＇
 ＋
 ＋ξ
 ＋＋
 －
 －β
 ＜
 ＜±
 ＜Δ
 ＜λ
 ＜φ
 ＜＜
 =
 ＝
 ＝☆
 ＝－
 ＞
 ＞λ
 ＿
 ～±
 ～＋
 ［⑤ｆ］
 ［⑤ｄ］
 ［②ｉ］
 ≈ 
 ［②Ｇ］
 ［①ｆ］
 ＬＩ
 ㈧ 
 ［－
 ......
 〉
 ［③⑩］
 第二
 一番
 一直
 一个
 一些
 许多
 种
 有的是
 也就是说
 末##末
 啊
 阿
 哎
 哎呀
 哎哟
 唉
 俺
 俺们
 按
 按照
 吧
 吧哒
 把
 罢了
 被
 本
 本着
 比
 比方
 比如
 鄙人
 彼
 彼此
 边
 别
 别的
 别说
 并
 并且
 不比
 不成
 不单
 不但
 不独
 不管
 不光
 不过
 不仅
 不拘
 不论
 不怕
 不然
 不如
 不特
 不惟
 不问
 不只
 朝
 朝着
 趁
 趁着
 乘
 冲
 除
 除此之外
 除非
 除了
 此
 此间
 此外
 从
 从而
 打
 待
 但
 但是
 当
 当着
 到
 得
 的
 的话
 等
 等等
 地
 第
 叮咚
 对
 对于
 多
 多少
 而
 而况
 而且
 而是
 而外
 而言
 而已
 尔后
 反过来
 反过来说
 反之
 非但
 非徒
 否则
 嘎
 嘎登
 该
 赶
 个
 各
 各个
 各位
 各种
 各自
 给
 根据
 跟
 故
 故此
 固然
 关于
 管
 归
 果然
 果真
 过
 哈
 哈哈
 呵
 和
 何
 何处
 何况
 何时
 嘿
 哼
 哼唷
 呼哧
 乎
 哗
 还是
 还有
 换句话说
 换言之
 或
 或是
 或者
 极了
 及
 及其
 及至
 即
 即便
 即或
 即令
 即若
 即使
 几
 几时
 己
 既
 既然
 既是
 继而
 加之
 假如
 假若
 假使
 鉴于
 将
 较
 较之
 叫
 接着
 结果
 借
 紧接着
 进而
 尽
 尽管
 经
 经过
 就
 就是
 就是说
 据
 具体地说
 具体说来
 开始
 开外
 靠
 咳
 可
 可见
 可是
 可以
 况且
 啦
 来
 来着
 离
 例如
 哩
 连
 连同
 两者
 了
 临
 另
 另外
 另一方面
 论
 嘛
 吗
 慢说
 漫说
 冒
 么
 每
 每当
 们
 莫若
 某
 某个
 某些
 拿
 哪
 哪边
 哪儿
 哪个
 哪里
 哪年
 哪怕
 哪天
 哪些
 哪样
 那
 那边
 那儿
 那个
 那会儿
 那里
 那么
 那么些
 那么样
 那时
 那些
 那样
 乃
 乃至
 呢
 能
 你
 你们
 您
 宁
 宁可
 宁肯
 宁愿
 哦
 呕
 啪达
 旁人
 呸
 凭
 凭借
 其
 其次
 其二
 其他
 其它
 其一
 其余
 其中
 起
 起见
 起见
 岂但
 恰恰相反
 前后
 前者
 且
 然而
 然后
 然则
 让
 人家
 任
 任何
 任凭
 如
 如此
 如果
 如何
 如其
 如若
 如上所述
 若
 若非
 若是
 啥
 上下
 尚且
 设若
 设使
 甚而
 甚么
 甚至
 省得
 时候
 什么
 什么样
 使得
 是
 是的
 首先
 谁
 谁知
 顺
 顺着
 似的
 虽
 虽然
 虽说
 虽则
 随
 随着
 所
 所以
 他
 他们
 他人
 它
 它们
 她
 她们
 倘
 倘或
 倘然
 倘若
 倘使
 腾
 替
 通过
 同
 同时
 哇
 万一
 往
 望
 为
 为何
 为了
 为什么
 为着
 喂
 嗡嗡
 我
 我们
 呜
 呜呼
 乌乎
 无论
 无宁
 毋宁
 嘻
 吓
 相对而言
 像
 向
 向着
 嘘
 呀
 焉
 沿
 沿着
 要
 要不
 要不然
 要不是
 要么
 要是
 也
 也罢
 也好
 一
 一般
 一旦
 一方面
 一来
 一切
 一样
 一则
 依
 依照
 矣
 以
 以便
 以及
 以免
 以至
 以至于
 以致
 抑或
 因
 因此
 因而
 因为
 哟
 用
 由
 由此可见
 由于
 有
 有的
 有关
 有些
 又
 于
 于是
 于是乎
 与
 与此同时
 与否
 与其
 越是
 云云
 哉
 再说
 再者
 在
 在下
 咱
 咱们
 则
 怎
 怎么
 怎么办
 怎么样
 怎样
 咋
 照
 照着
 者
 这
 这边
 这儿
 这个
 这会儿
 这就是说
 这里
 这么
 这么点儿
 这么些
 这么样
 这时
 这些
 这样
 正如
 吱
 之
 之类
 之所以
 之一
 只是
 只限
 只要
 只有
 至
 至于
 诸位
 着
 着呢
 自
 自从
 自个儿
 自各儿
 自己
 自家
 自身
 综上所述
 总的来看
 总的来说
 总的说来
 总而言之
 总之
 纵
 纵令
 纵然
 纵使
 遵照
 作为
 兮
 呃
 呗
 咚
 咦
 喏
 啐
 喔唷
 嗬
 嗯
 嗳
--- a/docs/代理使用.md
+++ b/docs/代理使用.md
@ -0,0 +1,47 @@
 ## 代理 IP 使用说明
 > 还是得跟大家再次强调下，不要对一些自媒体平台进行大规模爬虫或其他非法行为，要踩缝纫机的哦🤣
 ### 简易的流程图
 ![代理 IP 使用流程图](../static/images/代理IP%20流程图.drawio.png)
 ### 准备代理 IP 信息
 点击 <a href="https://www.kuaidaili.com/?ref=ldwkjqipvz6c">快代理</a> 官网注册并实名认证（国内使用代理 IP 必须要实名，懂的都懂）
 ### 获取 IP 代理的密钥信息
 从 <a href="https://www.kuaidaili.com/?ref=ldwkjqipvz6c">快代理</a> 官网获取免费试用，如下图所示
 ![img.png](../static/images/img.png)
 注意：选择私密代理
 ![img_1.png](../static/images/img_1.png)
 选择开通试用
 ![img_2.png](../static/images/img_2.png)
 初始化一个快代理的示例，如下代码所示，需要4个参数
 ```python
 def new_kuai_daili_proxy() -> KuaiDaiLiProxy:
    """
    构造快代理HTTP实例
    Returns:
    """
    return KuaiDaiLiProxy(
        kdl_secret_id=os.getenv("kdl_secret_id", "你的快代理secert_id"),
        kdl_signature=os.getenv("kdl_signature", "你的快代理签名"),
        kdl_user_name=os.getenv("kdl_user_name", "你的快代理用户名"),
        kdl_user_pwd=os.getenv("kdl_user_pwd", "你的快代理密码"),
    )
 ```
 在试用的订单中可以看到这四个参数，如下图所示
 `kdl_user_name`、`kdl_user_pwd`
 ![img_3.png](../static/images/img_3.png)
 `kdl_secret_id`、`kdl_signature`
 ![img_4.png](../static/images/img_4.png)
 ### 将配置文件中的`ENABLE_IP_PROXY`置为 `True`
 > `IP_PROXY_POOL_COUNT` 池子中 IP 的数量
--- a/docs/关于词云图相关操作.md
+++ b/docs/关于词云图相关操作.md
@ -0,0 +1,58 @@
 # 关于词云图相关操作
 ### 1.如何正确调用词云图
 ***ps:目前只有保存格式为json文件时，才会生成词云图。其他存储方式添加词云图将在近期添加。***
 需要修改的配置项（./config/base_config.py）：
 ```python
 # 数据保存类型选项配置,支持三种类型：csv、db、json
 #此处需要为json格式保存，原因如上
 SAVE_DATA_OPTION = "json"  # csv or db or json
 ```
 ```python
 # 是否开启爬评论模式, 默认不开启爬评论
 #此处为True，需要爬取评论才可以生成评论的词云图。
 ENABLE_GET_COMMENTS = True
 ```
 ```python
 #词云相关
 #是否开启生成评论词云图
 #打开词云图功能
 ENABLE_GET_WORDCLOUD = True
 ```
 ```python
 # 添加自定义词语及其分组
 #添加规则：xx:yy 其中xx为自定义添加的词组，yy为将xx该词组分到的组名。
 CUSTOM_WORDS = {
    '零几': '年份',  # 将“零几”识别为一个整体
    '高频词': '专业术语'  # 示例自定义词
 }
 ```
 ```python
 #停用(禁用)词文件路径
 STOP_WORDS_FILE = "./docs/hit_stopwords.txt"
 ```
 ```python
 #中文字体文件路径
 FONT_PATH= "./docs/STZHONGS.TTF"
 ```
 **相关解释**
 - 自定义词组的添加，`xx:yy` 中`xx`为自定义词语，`yy`为`xx`分配词语的组别。`yy`可以随便给任意值。
 - 如果需要添加禁用词，请在./docs/hit_stopwords.txt添加禁用词(保证格式正确，一个词语一行)
 - `FONT_PATH`为生成词云图中中文字体的格式，默认为宋体。可以自行添加字体文件，修改路径。
 ## 2.生成词云图的位置
 ![image-20240627204928601](https://rosyrain.oss-cn-hangzhou.aliyuncs.com/img2/202406272049662.png)
 如图，在data文件下的`words文件夹`下，其中json为词频统计文件，png为词云图。原本的评论内容在`json文件夹`下。
--- a/docs/常见问题.md
+++ b/docs/常见问题.md
@ -0,0 +1,31 @@
 ## 常见程序运行出错问题
 Q: 爬取抖音报错: `execjs._exceptions.ProgramError: SyntaxError: 缺少 ';'` <br>
 A: 该错误为缺少 nodejs 环境，这个错误可以通过安装 nodejs 环境来解决，版本为：`v16.8.0` <br>
 Q: 使用Cookie爬取抖音报错: execjs._exceptions.ProgramError: TypeError: Cannot read property 'JS_MD5_NO_COMMON_JS' of null
 A: windows电脑去网站下载`https://nodejs.org/en/blog/release/v16.8.0` Windows 64-bit Installer 版本，一直下一步即可。
 Q: 可以指定关键词爬取吗？<br>
 A: 在config/base_config.py 中 KEYWORDS 参数用于控制需要爬取的关键词 <br>
 Q: 可以指定帖子爬取吗？<br>
 A：在config/base_config.py 中 XHS_SPECIFIED_ID_LIST 参数用于控制需要指定爬取的帖子ID列表 <br>
 Q: 刚开始能爬取数据，过一段时间就是失效了？<br>
 A：出现这种情况多半是由于你的账号触发了平台风控机制了，❗️❗️请勿大规模对平台进行爬虫，影响平台。<br>
 Q: 如何更换登录账号？<br>
 A：删除项目根目录下的 brower_data/ 文件夹即可 <br>
 Q: 报错 `playwright._impl._api_types.TimeoutError: Timeout 30000ms exceeded.`<br>
 A: 出现这种情况检查下开梯子没有<br>
 Q: 小红书扫码登录成功后如何手动验证?
 A: 打开 config/base_config.py 文件, 找到 HEADLESS 配置项, 将其设置为 False, 此时重启项目, 在浏览器中手动通过验证码<br>
 Q: 如何配置词云图的生成?
 A: 打开 config/base_config.py 文件, 找到`ENABLE_GET_WORDCLOUD` 以及`ENABLE_GET_COMMENTS` 两个配置项，将其都设为True即可使用该功能。<br>
 Q: 如何给词云图添加禁用词和自定义词组？
 A: 打开 `docs/hit_stopwords.txt` 输入禁用词(注意一个词语一行)。打开 config/base_config.py 文件找到 `CUSTOM_WORDS `按格式添加自定义词组即可。<br>
--- a/docs/手机号登录说明.md
+++ b/docs/手机号登录说明.md
@ -0,0 +1,20 @@
 ## 关于手机号+验证码登录的说明
 当在浏览器模拟人为发起手机号登录请求时，使用短信转发软件将验证码发送至爬虫端回填，完成自动登录
 准备工作：
 - 安卓机1台（IOS没去研究，理论上监控短信也是可行的）
 - 安装短信转发软件 [参考仓库](https://github.com/pppscn/SmsForwarder)
 - 转发软件中配置WEBHOOK相关的信息，主要分为 消息模板（请查看本项目中的recv_sms_notification.py）、一个能push短信通知的API地址
 - push的API地址一般是需要绑定一个域名的（当然也可以是内网的IP地址），我用的是内网穿透方式，会有一个免费的域名绑定到内网的web
  server，内网穿透工具 [ngrok](https://ngrok.com/docs/)
 - 安装redis并设置一个密码 [redis安装](https://www.cnblogs.com/hunanzp/p/12304622.html)
 - 执行 `python recv_sms_notification.py` 等待短信转发器发送HTTP通知
 - 执行手机号登录的爬虫程序 `python main.py --platform xhs --lt phone`
 备注：
 - 小红书这边一个手机号一天只能发10条短信（悠着点），目前在发验证码时还未触发滑块验证，估计多了之后也会有~
 - 短信转发软件会不会监控自己手机上其他短信内容？（理论上应该不会，因为[短信转发仓库](https://github.com/pppscn/SmsForwarder)
 star还是蛮多的）
--- a/docs/项目代码结构.md
+++ b/docs/项目代码结构.md
@ -0,0 +1,38 @@
 ## 项目代码结构
 ```
 MediaCrawler
 ├── base 
 │   └── base_crawler.py         # 项目的抽象类
 ├── browser_data                # 换成用户的浏览器数据目录 
 ├── config 
 │   ├── account_config.py       # 账号代理池配置
 │   ├── base_config.py          # 基础配置
 │   └── db_config.py            # 数据库配置
 ├── data                        # 数据保存目录  
 ├── libs 
 │   ├── douyin.js               # 抖音Sign函数
 │   └── stealth.min.js          # 去除浏览器自动化特征的JS
 ├── media_platform
 │   ├── douyin                  # 抖音crawler实现
 │   ├── xhs                     # 小红书crawler实现
 │   ├── bilibili                # B站crawler实现  
 │   └── kuaishou                # 快手crawler实现
 ├── modles 
 │   ├── douyin.py               # 抖音数据模型
 │   ├── xiaohongshu.py          # 小红书数据模型
 │   ├── kuaishou.py             # 快手数据模型
 │   └── bilibili.py             # B站数据模型 
 ├── tools
 │   ├── utils.py                # 暴露给外部的工具函数
 │   ├── crawler_util.py         # 爬虫相关的工具函数
 │   ├── slider_util.py          # 滑块相关的工具函数
 │   ├── time_util.py            # 时间相关的工具函数
 │   ├── easing.py               # 模拟滑动轨迹相关的函数
 |   └── words.py				# 生成词云图相关的函数
 ├── db.py                       # DB ORM
 ├── main.py                     # 程序入口
 ├── var.py                      # 上下文变量定义
 └── recv_sms_notification.py    # 短信转发器的HTTP SERVER接口
 ```
--- a/libs/douyin.js
+++ b/libs/douyin.js
--- a/libs/stealth.min.js
+++ b/libs/stealth.min.js
--- a/main.py
+++ b/main.py
@ -0,0 +1,51 @@
 import asyncio
 import sys
 import cmd_arg
 import config
 import db
 from base.base_crawler import AbstractCrawler
 from media_platform.bilibili import BilibiliCrawler
 from media_platform.douyin import DouYinCrawler
 from media_platform.kuaishou import KuaishouCrawler
 from media_platform.weibo import WeiboCrawler
 from media_platform.xhs import XiaoHongShuCrawler
 class CrawlerFactory:
    CRAWLERS = {
        "xhs": XiaoHongShuCrawler,
        "dy": DouYinCrawler,
        "ks": KuaishouCrawler,
        "bili": BilibiliCrawler,
        "wb": WeiboCrawler
    }
    @staticmethod
    def create_crawler(platform: str) -> AbstractCrawler:
        crawler_class = CrawlerFactory.CRAWLERS.get(platform)
        if not crawler_class:
            raise ValueError("Invalid Media Platform Currently only supported xhs or dy or ks or bili ...")
        return crawler_class()
 async def main():
    # parse cmd
    await cmd_arg.parse_cmd()
    # init db
    if config.SAVE_DATA_OPTION == "db":
        await db.init_db()
    crawler = CrawlerFactory.create_crawler(platform=config.PLATFORM)
    await crawler.start()
    if config.SAVE_DATA_OPTION == "db":
        await db.close()
 if __name__ == '__main__':
    try:
        # asyncio.run(main())
        asyncio.get_event_loop().run_until_complete(main())
    except KeyboardInterrupt:
        sys.exit()
--- a/media_platform/init.py
+++ b/media_platform/init.py
--- a/media_platform/bilibili/init.py
+++ b/media_platform/bilibili/init.py
@ -0,0 +1,6 @@
 # -*- coding: utf-8 -*-
 # @Author  : relakkes@gmail.com
 # @Time    : 2023/12/2 18:36
 # @Desc    :
 from .core import *
--- a/media_platform/bilibili/client.py
+++ b/media_platform/bilibili/client.py
@ -0,0 +1,287 @@
 # -*- coding: utf-8 -*-
 # @Author  : relakkes@gmail.com
 # @Time    : 2023/12/2 18:44
 # @Desc    : bilibili 请求客户端
 import asyncio
 import json
 from typing import Any, Callable, Dict, List, Optional, Tuple, Union
 from urllib.parse import urlencode
 import httpx
 from playwright.async_api import BrowserContext, Page
 from base.base_crawler import AbstractApiClient
 from tools import utils
 from .exception import DataFetchError
 from .field import CommentOrderType, SearchOrderType
 from .help import BilibiliSign
 class BilibiliClient(AbstractApiClient):
    def __init__(
            self,
            timeout=10,
            proxies=None,
            *,
            headers: Dict[str, str],
            playwright_page: Page,
            cookie_dict: Dict[str, str],
    ):
        self.proxies = proxies
        self.timeout = timeout
        self.headers = headers
        self._host = "https://api.bilibili.com"
        self.playwright_page = playwright_page
        self.cookie_dict = cookie_dict
    async def request(self, method, url, **kwargs) -> Any:
        async with httpx.AsyncClient(proxies=self.proxies) as client:
            response = await client.request(
                method, url, timeout=self.timeout,
                **kwargs
            )
        data: Dict = response.json()
        if data.get("code") != 0:
            raise DataFetchError(data.get("message", "unkonw error"))
        else:
            return data.get("data", {})
    async def pre_request_data(self, req_data: Dict) -> Dict:
        """
        发送请求进行请求参数签名
        需要从 localStorage 拿 wbi_img_urls 这参数，值如下：
        https://i0.hdslb.com/bfs/wbi/7cd084941338484aae1ad9425b84077c.png-https://i0.hdslb.com/bfs/wbi/4932caff0ff746eab6f01bf08b70ac45.png
        :param req_data:
        :return:
        """
        if not req_data:
            return {}
        img_key, sub_key = await self.get_wbi_keys()
        return BilibiliSign(img_key, sub_key).sign(req_data)
    async def get_wbi_keys(self) -> Tuple[str, str]:
        """
        获取最新的 img_key 和 sub_key
        :return:
        """
        local_storage = await self.playwright_page.evaluate("() => window.localStorage")
        wbi_img_urls = local_storage.get("wbi_img_urls", "") or local_storage.get(
            "wbi_img_url") + "-" + local_storage.get("wbi_sub_url")
        if wbi_img_urls and "-" in wbi_img_urls:
            img_url, sub_url = wbi_img_urls.split("-")
        else:
            resp = await self.request(method="GET", url=self._host + "/x/web-interface/nav")
            img_url: str = resp['wbi_img']['img_url']
            sub_url: str = resp['wbi_img']['sub_url']
        img_key = img_url.rsplit('/', 1)[1].split('.')[0]
        sub_key = sub_url.rsplit('/', 1)[1].split('.')[0]
        return img_key, sub_key
    async def get(self, uri: str, params=None, enable_params_sign: bool = True) -> Dict:
        final_uri = uri
        if enable_params_sign:
            params = await self.pre_request_data(params)
        if isinstance(params, dict):
            final_uri = (f"{uri}?"
                         f"{urlencode(params)}")
        return await self.request(method="GET", url=f"{self._host}{final_uri}", headers=self.headers)
    async def post(self, uri: str, data: dict) -> Dict:
        data = await self.pre_request_data(data)
        json_str = json.dumps(data, separators=(',', ':'), ensure_ascii=False)
        return await self.request(method="POST", url=f"{self._host}{uri}",
                                  data=json_str, headers=self.headers)
    async def pong(self) -> bool:
        """get a note to check if login state is ok"""
        utils.logger.info("[BilibiliClient.pong] Begin pong bilibili...")
        ping_flag = False
        try:
            check_login_uri = "/x/web-interface/nav"
            response = await self.get(check_login_uri)
            if response.get("isLogin"):
                utils.logger.info(
                    "[BilibiliClient.pong] Use cache login state get web interface successfull!")
                ping_flag = True
        except Exception as e:
            utils.logger.error(
                f"[BilibiliClient.pong] Pong bilibili failed: {e}, and try to login again...")
            ping_flag = False
        return ping_flag
    async def update_cookies(self, browser_context: BrowserContext):
        cookie_str, cookie_dict = utils.convert_cookies(await browser_context.cookies())
        self.headers["Cookie"] = cookie_str
        self.cookie_dict = cookie_dict
    async def search_video_by_keyword(self, keyword: str, page: int = 1, page_size: int = 20,
                                      order: SearchOrderType = SearchOrderType.DEFAULT):
        """
        KuaiShou web search api
        :param keyword: 搜索关键词
        :param page: 分页参数具体第几页
        :param page_size: 每一页参数的数量
        :param order: 搜索结果排序，默认位综合排序
        :return:
        """
        uri = "/x/web-interface/wbi/search/type"
        post_data = {
            "search_type": "video",
            "keyword": keyword,
            "page": page,
            "page_size": page_size,
            "order": order.value
        }
        return await self.get(uri, post_data)
    async def get_video_info(self, aid: Union[int, None] = None, bvid: Union[str, None] = None) -> Dict:
        """
        Bilibli web video detail api, aid 和 bvid任选一个参数
        :param aid: 稿件avid
        :param bvid: 稿件bvid
        :return:
        """
        if not aid and not bvid:
            raise ValueError("请提供 aid 或 bvid 中的至少一个参数")
        uri = "/x/web-interface/view/detail"
        params = dict()
        if aid:
            params.update({"aid": aid})
        else:
            params.update({"bvid": bvid})
        return await self.get(uri, params, enable_params_sign=False)
    async def get_video_comments(self,
                                 video_id: str,
                                 order_mode: CommentOrderType = CommentOrderType.DEFAULT,
                                 next: int = 0
                                 ) -> Dict:
        """get video comments
        :param video_id: 视频 ID
        :param order_mode: 排序方式
        :param next: 评论页选择
        :return:
        """
        uri = "/x/v2/reply/wbi/main"
        post_data = {
            "oid": video_id,
            "mode": order_mode.value,
            "type": 1,
            "ps": 20,
            "next": next
        }
        return await self.get(uri, post_data)
    async def get_video_all_comments(self, video_id: str, crawl_interval: float = 1.0, is_fetch_sub_comments=False,
                                     callback: Optional[Callable] = None, ):
        """
        get video all comments include sub comments
        :param video_id:
        :param crawl_interval:
        :param is_fetch_sub_comments:
        :param callback:
        :return:
        """
        result = []
        is_end = False
        next_page = 0
        while not is_end:
            comments_res = await self.get_video_comments(video_id, CommentOrderType.DEFAULT, next_page)
            cursor_info: Dict = comments_res.get("cursor")
            comment_list: List[Dict] = comments_res.get("replies", [])
            is_end = cursor_info.get("is_end")
            next_page = cursor_info.get("next")
            if is_fetch_sub_comments:
                for comment in comment_list:
                    comment_id = comment['rpid']
                    if (comment.get("rcount", 0) > 0):
                        {
                            await self.get_video_all_level_two_comments(
                                video_id, comment_id, CommentOrderType.DEFAULT, 10, crawl_interval,  callback)
                        }
            if callback:  # 如果有回调函数，就执行回调函数
                await callback(video_id, comment_list)
            await asyncio.sleep(crawl_interval)
            if not is_fetch_sub_comments:
                result.extend(comment_list)
                continue
        return result
    async def get_video_all_level_two_comments(self,
                                               video_id: str,
                                               level_one_comment_id: int,
                                               order_mode: CommentOrderType,
                                               ps: int = 10,
                                               crawl_interval: float = 1.0,
                                               callback: Optional[Callable] = None,
                                               ) -> Dict:
        """
        get video all level two comments for a level one comment
        :param video_id: 视频 ID
        :param level_one_comment_id: 一级评论 ID
        :param order_mode:
        :param ps: 一页评论数
        :param crawl_interval:
        :param callback:
        :return:
        """
        pn = 1
        while True:
            result = await self.get_video_level_two_comments(
                video_id, level_one_comment_id, pn, ps, order_mode)
            comment_list: List[Dict] = result.get("replies", [])
            if callback:  # 如果有回调函数，就执行回调函数
                await callback(video_id, comment_list)
            await asyncio.sleep(crawl_interval)
            if (int(result["page"]["count"]) <= pn * ps):
                break
            pn += 1
    async def get_video_level_two_comments(self,
                                           video_id: str,
                                           level_one_comment_id: int,
                                           pn: int,
                                           ps: int,
                                           order_mode: CommentOrderType,
                                           ) -> Dict:
        """get video level two comments
        :param video_id: 视频 ID
        :param level_one_comment_id: 一级评论 ID
        :param order_mode: 排序方式
        :return:
        """
        uri = "/x/v2/reply/reply"
        post_data = {
            "oid": video_id,
            "mode": order_mode.value,
            "type": 1,
            "ps": ps,
            "pn": pn,
            "root": level_one_comment_id,
        }
        result = await self.get(uri, post_data)
        return result
    async def get_creator_videos(self, creator_id: str, pn: int, ps: int = 30, order_mode: SearchOrderType = SearchOrderType.LAST_PUBLISH) -> Dict:
        """get all videos for a creator
        :param creator_id: 创作者 ID
        :param pn: 页数
        :param ps: 一页视频数
        :param order_mode: 排序方式
        :return:
        """
        uri = "/x/space/wbi/arc/search"
        post_data = {
            "mid": creator_id,
            "pn": pn,
            "ps": ps,
            "order": order_mode,
        }
        return await self.get(uri, post_data)
--- a/media_platform/bilibili/core.py
+++ b/media_platform/bilibili/core.py
@ -0,0 +1,302 @@
 # -*- coding: utf-8 -*-
 # @Author  : relakkes@gmail.com
 # @Time    : 2023/12/2 18:44
 # @Desc    : B站爬虫
 import asyncio
 import os
 import random
 from asyncio import Task
 from typing import Dict, List, Optional, Tuple
 from playwright.async_api import (BrowserContext, BrowserType, Page,
                                  async_playwright)
 import config
 from base.base_crawler import AbstractCrawler
 from proxy.proxy_ip_pool import IpInfoModel, create_ip_pool
 from store import bilibili as bilibili_store
 from tools import utils
 from var import crawler_type_var
 from .client import BilibiliClient
 from .exception import DataFetchError
 from .field import SearchOrderType
 from .login import BilibiliLogin
 class BilibiliCrawler(AbstractCrawler):
    context_page: Page
    bili_client: BilibiliClient
    browser_context: BrowserContext
    def __init__(self):
        self.index_url = "https://www.bilibili.com"
        self.user_agent = utils.get_user_agent()
    async def start(self):
        playwright_proxy_format, httpx_proxy_format = None, None
        if config.ENABLE_IP_PROXY:
            ip_proxy_pool = await create_ip_pool(config.IP_PROXY_POOL_COUNT, enable_validate_ip=True)
            ip_proxy_info: IpInfoModel = await ip_proxy_pool.get_proxy()
            playwright_proxy_format, httpx_proxy_format = self.format_proxy_info(
                ip_proxy_info)
        async with async_playwright() as playwright:
            # Launch a browser context.
            chromium = playwright.chromium
            self.browser_context = await self.launch_browser(
                chromium,
                None,
                self.user_agent,
                headless=config.HEADLESS
            )
            # stealth.min.js is a js script to prevent the website from detecting the crawler.
            await self.browser_context.add_init_script(path="libs/stealth.min.js")
            self.context_page = await self.browser_context.new_page()
            await self.context_page.goto(self.index_url)
            # Create a client to interact with the xiaohongshu website.
            self.bili_client = await self.create_bilibili_client(httpx_proxy_format)
            if not await self.bili_client.pong():
                login_obj = BilibiliLogin(
                    login_type=config.LOGIN_TYPE,
                    login_phone="",  # your phone number
                    browser_context=self.browser_context,
                    context_page=self.context_page,
                    cookie_str=config.COOKIES
                )
                await login_obj.begin()
                await self.bili_client.update_cookies(browser_context=self.browser_context)
            crawler_type_var.set(config.CRAWLER_TYPE)
            if config.CRAWLER_TYPE == "search":
                # Search for video and retrieve their comment information.
                await self.search()
            elif config.CRAWLER_TYPE == "detail":
                # Get the information and comments of the specified post
                await self.get_specified_videos(config.BILI_SPECIFIED_ID_LIST)
            elif config.CRAWLER_TYPE == "creator":
                for creator_id in config.BILI_CREATOR_ID_LIST:
                    await self.get_creator_videos(int(creator_id))
            else:
                pass
            utils.logger.info(
                "[BilibiliCrawler.start] Bilibili Crawler finished ...")
    async def search(self):
        """
        search bilibili video with keywords
        :return:
        """
        utils.logger.info(
            "[BilibiliCrawler.search] Begin search bilibli keywords")
        bili_limit_count = 20  # bilibili limit page fixed value
        if config.CRAWLER_MAX_NOTES_COUNT < bili_limit_count:
            config.CRAWLER_MAX_NOTES_COUNT = bili_limit_count
        start_page = config.START_PAGE  # start page number
        for keyword in config.KEYWORDS.split(","):
            utils.logger.info(
                f"[BilibiliCrawler.search] Current search keyword: {keyword}")
            page = 1
            while (page - start_page + 1) * bili_limit_count <= config.CRAWLER_MAX_NOTES_COUNT:
                if page < start_page:
                    utils.logger.info(
                        f"[BilibiliCrawler.search] Skip page: {page}")
                    page += 1
                    continue
                utils.logger.info(f"[BilibiliCrawler.search] search bilibili keyword: {keyword}, page: {page}")
                video_id_list: List[str] = []
                videos_res = await self.bili_client.search_video_by_keyword(
                    keyword=keyword,
                    page=page,
                    page_size=bili_limit_count,
                    order=SearchOrderType.DEFAULT,
                )
                video_list: List[Dict] = videos_res.get("result")
                semaphore = asyncio.Semaphore(config.MAX_CONCURRENCY_NUM)
                task_list = [
                    self.get_video_info_task(aid=video_item.get(
                        "aid"), bvid="", semaphore=semaphore)
                    for video_item in video_list
                ]
                video_items = await asyncio.gather(*task_list)
                for video_item in video_items:
                    if video_item:
                        video_id_list.append(video_item.get("View").get("aid"))
                        await bilibili_store.update_bilibili_video(video_item)
                page += 1
                await self.batch_get_video_comments(video_id_list)
    async def batch_get_video_comments(self, video_id_list: List[str]):
        """
        batch get video comments
        :param video_id_list:
        :return:
        """
        if not config.ENABLE_GET_COMMENTS:
            utils.logger.info(
                f"[BilibiliCrawler.batch_get_note_comments] Crawling comment mode is not enabled")
            return
        utils.logger.info(
            f"[BilibiliCrawler.batch_get_video_comments] video ids:{video_id_list}")
        semaphore = asyncio.Semaphore(config.MAX_CONCURRENCY_NUM)
        task_list: List[Task] = []
        for video_id in video_id_list:
            task = asyncio.create_task(self.get_comments(
                video_id, semaphore), name=video_id)
            task_list.append(task)
        await asyncio.gather(*task_list)
    async def get_comments(self, video_id: str, semaphore: asyncio.Semaphore):
        """
        get comment for video id
        :param video_id:
        :param semaphore:
        :return:
        """
        async with semaphore:
            try:
                utils.logger.info(
                    f"[BilibiliCrawler.get_comments] begin get video_id: {video_id} comments ...")
                await self.bili_client.get_video_all_comments(
                    video_id=video_id,
                    crawl_interval=random.random(),
                    is_fetch_sub_comments=config.ENABLE_GET_SUB_COMMENTS,
                    callback=bilibili_store.batch_update_bilibili_video_comments
                )
            except DataFetchError as ex:
                utils.logger.error(
                    f"[BilibiliCrawler.get_comments] get video_id: {video_id} comment error: {ex}")
            except Exception as e:
                utils.logger.error(
                    f"[BilibiliCrawler.get_comments] may be been blocked, err:{e}")
    async def get_creator_videos(self, creator_id: int):
        """
        get videos for a creator
        :return:
        """
        ps = 30
        pn = 1
        video_bvids_list = []
        while True:
            result = await self.bili_client.get_creator_videos(creator_id, pn, ps)
            for video in result["list"]["vlist"]:
                video_bvids_list.append(video["bvid"])
            if (int(result["page"]["count"]) <= pn * ps):
                break
            await asyncio.sleep(random.random())
            pn += 1
        await self.get_specified_videos(video_bvids_list)
    async def get_specified_videos(self, bvids_list: List[str]):
        """
        get specified videos info
        :return:
        """
        semaphore = asyncio.Semaphore(config.MAX_CONCURRENCY_NUM)
        task_list = [
            self.get_video_info_task(aid=0, bvid=video_id, semaphore=semaphore) for video_id in
            bvids_list
        ]
        video_details = await asyncio.gather(*task_list)
        video_aids_list = []
        for video_detail in video_details:
            if video_detail is not None:
                video_item_view: Dict = video_detail.get("View")
                video_aid: str = video_item_view.get("aid")
                if video_aid:
                    video_aids_list.append(video_aid)
                await bilibili_store.update_bilibili_video(video_detail)
        await self.batch_get_video_comments(video_aids_list)
    async def get_video_info_task(self, aid: int, bvid: str, semaphore: asyncio.Semaphore) -> Optional[Dict]:
        """
        Get video detail task
        :param aid:
        :param bvid:
        :param semaphore:
        :return:
        """
        async with semaphore:
            try:
                result = await self.bili_client.get_video_info(aid=aid, bvid=bvid)
                return result
            except DataFetchError as ex:
                utils.logger.error(
                    f"[BilibiliCrawler.get_video_info_task] Get video detail error: {ex}")
                return None
            except KeyError as ex:
                utils.logger.error(
                    f"[BilibiliCrawler.get_video_info_task] have not fund note detail video_id:{bvid}, err: {ex}")
                return None
    async def create_bilibili_client(self, httpx_proxy: Optional[str]) -> BilibiliClient:
        """Create xhs client"""
        utils.logger.info(
            "[BilibiliCrawler.create_bilibili_client] Begin create bilibili API client ...")
        cookie_str, cookie_dict = utils.convert_cookies(await self.browser_context.cookies())
        bilibili_client_obj = BilibiliClient(
            proxies=httpx_proxy,
            headers={
                "User-Agent": self.user_agent,
                "Cookie": cookie_str,
                "Origin": "https://www.bilibili.com",
                "Referer": "https://www.bilibili.com",
                "Content-Type": "application/json;charset=UTF-8"
            },
            playwright_page=self.context_page,
            cookie_dict=cookie_dict,
        )
        return bilibili_client_obj
    @staticmethod
    def format_proxy_info(ip_proxy_info: IpInfoModel) -> Tuple[Optional[Dict], Optional[Dict]]:
        """format proxy info for playwright and httpx"""
        playwright_proxy = {
            "server": f"{ip_proxy_info.protocol}{ip_proxy_info.ip}:{ip_proxy_info.port}",
            "username": ip_proxy_info.user,
            "password": ip_proxy_info.password,
        }
        httpx_proxy = {
            f"{ip_proxy_info.protocol}": f"http://{ip_proxy_info.user}:{ip_proxy_info.password}@{ip_proxy_info.ip}:{ip_proxy_info.port}"
        }
        return playwright_proxy, httpx_proxy
    async def launch_browser(
            self,
            chromium: BrowserType,
            playwright_proxy: Optional[Dict],
            user_agent: Optional[str],
            headless: bool = True
    ) -> BrowserContext:
        """Launch browser and create browser context"""
        utils.logger.info(
            "[BilibiliCrawler.launch_browser] Begin create browser context ...")
        if config.SAVE_LOGIN_STATE:
            # feat issue #14
            # we will save login state to avoid login every time
            user_data_dir = os.path.join(os.getcwd(), "browser_data",
                                         config.USER_DATA_DIR % config.PLATFORM)  # type: ignore
            browser_context = await chromium.launch_persistent_context(
                user_data_dir=user_data_dir,
                accept_downloads=True,
                headless=headless,
                proxy=playwright_proxy,  # type: ignore
                viewport={"width": 1920, "height": 1080},
                user_agent=user_agent
            )
            return browser_context
        else:
            # type: ignore
            browser = await chromium.launch(headless=headless, proxy=playwright_proxy)
            browser_context = await browser.new_context(
                viewport={"width": 1920, "height": 1080},
                user_agent=user_agent
            )
            return browser_context
--- a/media_platform/bilibili/exception.py
+++ b/media_platform/bilibili/exception.py
@ -0,0 +1,14 @@
 # -*- coding: utf-8 -*-
 # @Author  : relakkes@gmail.com
 # @Time    : 2023/12/2 18:44
 # @Desc    :
 from httpx import RequestError
 class DataFetchError(RequestError):
    """something error when fetch"""
 class IPBlockError(RequestError):
    """fetch so fast that the server block us ip"""
--- a/media_platform/bilibili/field.py
+++ b/media_platform/bilibili/field.py
@ -0,0 +1,34 @@
 # -*- coding: utf-8 -*-
 # @Author  : relakkes@gmail.com
 # @Time    : 2023/12/3 16:20
 # @Desc    :
 from enum import Enum
 class SearchOrderType(Enum):
    # 综合排序
    DEFAULT = ""
    # 最多点击
    MOST_CLICK = "click"
    # 最新发布
    LAST_PUBLISH = "pubdate"
    # 最多弹幕
    MOST_DANMU = "dm"
    # 最多收藏
    MOST_MARK = "stow"
 class CommentOrderType(Enum):
    # 仅按热度
    DEFAULT = 0
    # 按热度+按时间
    MIXED = 1
    # 按时间
    TIME = 2
--- a/media_platform/bilibili/help.py
+++ b/media_platform/bilibili/help.py
@ -0,0 +1,70 @@
    # -*- coding: utf-8 -*-
 # @Author  : relakkes@gmail.com
 # @Time    : 2023/12/2 23:26
 # @Desc    : bilibili 请求参数签名
 # 逆向实现参考：https://socialsisteryi.github.io/bilibili-API-collect/docs/misc/sign/wbi.html#wbi%E7%AD%BE%E5%90%8D%E7%AE%97%E6%B3%95
 import urllib.parse
 from hashlib import md5
 from typing import Dict
 from tools import utils
 class BilibiliSign:
    def __init__(self, img_key: str, sub_key: str):
        self.img_key = img_key
        self.sub_key = sub_key
        self.map_table = [
            46, 47, 18, 2, 53, 8, 23, 32, 15, 50, 10, 31, 58, 3, 45, 35, 27, 43, 5, 49,
            33, 9, 42, 19, 29, 28, 14, 39, 12, 38, 41, 13, 37, 48, 7, 16, 24, 55, 40,
            61, 26, 17, 0, 1, 60, 51, 30, 4, 22, 25, 54, 21, 56, 59, 6, 63, 57, 62, 11,
            36, 20, 34, 44, 52
        ]
    def get_salt(self) -> str:
        """
        获取加盐的 key
        :return:
        """
        salt = ""
        mixin_key = self.img_key + self.sub_key
        for mt in self.map_table:
            salt += mixin_key[mt]
        return salt[:32]
    def sign(self, req_data: Dict) -> Dict:
        """
        请求参数中加上当前时间戳对请求参数中的key进行字典序排序
        再将请求参数进行 url 编码集合 salt 进行 md5 就可以生成w_rid参数了
        :param req_data:
        :return:
        """
        current_ts = utils.get_unix_timestamp()
        req_data.update({"wts": current_ts})
        req_data = dict(sorted(req_data.items()))
        req_data = {
            # 过滤 value 中的 "!'()*" 字符
            k: ''.join(filter(lambda ch: ch not in "!'()*", str(v)))
            for k, v
            in req_data.items()
        }
        query = urllib.parse.urlencode(req_data)
        salt = self.get_salt()
        wbi_sign = md5((query + salt).encode()).hexdigest()  # 计算 w_rid
        req_data['w_rid'] = wbi_sign
        return req_data
 if __name__ == '__main__':
    _img_key = "7cd084941338484aae1ad9425b84077c"
    _sub_key = "4932caff0ff746eab6f01bf08b70ac45"
    _search_url = "__refresh__=true&_extra=&ad_resource=5654&category_id=&context=&dynamic_offset=0&from_source=&from_spmid=333.337&gaia_vtoken=&highlight=1&keyword=python&order=click&page=1&page_size=20&platform=pc&qv_id=OQ8f2qtgYdBV1UoEnqXUNUl8LEDAdzsD&search_type=video&single_column=0&source_tag=3&web_location=1430654"
    _req_data = dict()
    for params in _search_url.split("&"):
        kvalues = params.split("=")
        key = kvalues[0]
        value = kvalues[1]
        _req_data[key] = value
    print("pre req_data", _req_data)
    _req_data = BilibiliSign(img_key=_img_key, sub_key=_sub_key).sign(req_data={"aid":170001})
    print(_req_data)
--- a/media_platform/bilibili/login.py
+++ b/media_platform/bilibili/login.py
@ -0,0 +1,107 @@
 # -*- coding: utf-8 -*-
 # @Author  : relakkes@gmail.com
 # @Time    : 2023/12/2 18:44
 # @Desc    : bilibli登录实现类
 import asyncio
 import functools
 import sys
 from typing import Optional
 from playwright.async_api import BrowserContext, Page
 from tenacity import (RetryError, retry, retry_if_result, stop_after_attempt,
                      wait_fixed)
 import config
 from base.base_crawler import AbstractLogin
 from tools import utils
 class BilibiliLogin(AbstractLogin):
    def __init__(self,
                 login_type: str,
                 browser_context: BrowserContext,
                 context_page: Page,
                 login_phone: Optional[str] = "",
                 cookie_str: str = ""
                 ):
        config.LOGIN_TYPE = login_type
        self.browser_context = browser_context
        self.context_page = context_page
        self.login_phone = login_phone
        self.cookie_str = cookie_str
    async def begin(self):
        """Start login bilibili"""
        utils.logger.info("[BilibiliLogin.begin] Begin login Bilibili ...")
        if config.LOGIN_TYPE == "qrcode":
            await self.login_by_qrcode()
        elif config.LOGIN_TYPE == "phone":
            await self.login_by_mobile()
        elif config.LOGIN_TYPE == "cookie":
            await self.login_by_cookies()
        else:
            raise ValueError(
                "[BilibiliLogin.begin] Invalid Login Type Currently only supported qrcode or phone or cookie ...")
    @retry(stop=stop_after_attempt(600), wait=wait_fixed(1), retry=retry_if_result(lambda value: value is False))
    async def check_login_state(self) -> bool:
        """
            Check if the current login status is successful and return True otherwise return False
            retry decorator will retry 20 times if the return value is False, and the retry interval is 1 second
            if max retry times reached, raise RetryError
        """
        current_cookie = await self.browser_context.cookies()
        _, cookie_dict = utils.convert_cookies(current_cookie)
        if cookie_dict.get("SESSDATA", "") or cookie_dict.get("DedeUserID"):
            return True
        return False
    async def login_by_qrcode(self):
        """login bilibili website and keep webdriver login state"""
        utils.logger.info("[BilibiliLogin.login_by_qrcode] Begin login bilibili by qrcode ...")
        # click login button
        login_button_ele = self.context_page.locator(
            "xpath=//div[@class='right-entry__outside go-login-btn']//div"
        )
        await login_button_ele.click()
        # find login qrcode
        qrcode_img_selector = "//div[@class='login-scan-box']//img"
        base64_qrcode_img = await utils.find_login_qrcode(
            self.context_page,
            selector=qrcode_img_selector
        )
        if not base64_qrcode_img:
            utils.logger.info("[BilibiliLogin.login_by_qrcode] login failed , have not found qrcode please check ....")
            sys.exit()
        # show login qrcode
        partial_show_qrcode = functools.partial(utils.show_qrcode, base64_qrcode_img)
        asyncio.get_running_loop().run_in_executor(executor=None, func=partial_show_qrcode)
        utils.logger.info(f"[BilibiliLogin.login_by_qrcode] Waiting for scan code login, remaining time is 20s")
        try:
            await self.check_login_state()
        except RetryError:
            utils.logger.info("[BilibiliLogin.login_by_qrcode] Login bilibili failed by qrcode login method ...")
            sys.exit()
        wait_redirect_seconds = 5
        utils.logger.info(
            f"[BilibiliLogin.login_by_qrcode] Login successful then wait for {wait_redirect_seconds} seconds redirect ...")
        await asyncio.sleep(wait_redirect_seconds)
    async def login_by_mobile(self):
        pass
    async def login_by_cookies(self):
        utils.logger.info("[BilibiliLogin.login_by_qrcode] Begin login bilibili by cookie ...")
        for key, value in utils.convert_str_cookie_to_dict(self.cookie_str).items():
            await self.browser_context.add_cookies([{
                'name': key,
                'value': value,
                'domain': ".bilibili.com",
                'path': "/"
            }])
--- a/media_platform/douyin/init.py
+++ b/media_platform/douyin/init.py
@ -0,0 +1 @@
 from .core import DouYinCrawler
--- a/media_platform/douyin/client.py
+++ b/media_platform/douyin/client.py
@ -0,0 +1,280 @@
 import asyncio
 import copy
 import json
 import urllib.parse
 from typing import Any, Callable, Dict, List, Optional
 import execjs
 import httpx
 from playwright.async_api import BrowserContext, Page
 from base.base_crawler import AbstractApiClient
 from tools import utils
 from var import request_keyword_var
 from .exception import *
 from .field import *
 class DOUYINClient(AbstractApiClient):
    def __init__(
            self,
            timeout=30,
            proxies=None,
            *,
            headers: Dict,
            playwright_page: Optional[Page],
            cookie_dict: Dict
    ):
        self.proxies = proxies
        self.timeout = timeout
        self.headers = headers
        self._host = "https://www.douyin.com"
        self.playwright_page = playwright_page
        self.cookie_dict = cookie_dict
    async def __process_req_params(self, params: Optional[Dict] = None, headers: Optional[Dict] = None):
        if not params:
            return
        headers = headers or self.headers
        local_storage: Dict = await self.playwright_page.evaluate("() => window.localStorage")  # type: ignore
        douyin_js_obj = execjs.compile(open('libs/douyin.js').read())
        common_params = {
            "device_platform": "webapp",
            "aid": "6383",
            "channel": "channel_pc_web",
            "cookie_enabled": "true",
            "browser_language": "zh-CN",
            "browser_platform": "Win32",
            "browser_name": "Firefox",
            "browser_version": "110.0",
            "browser_online": "true",
            "engine_name": "Gecko",
            "os_name": "Windows",
            "os_version": "10",
            "engine_version": "109.0",
            "platform": "PC",
            "screen_width": "1920",
            "screen_height": "1200",
            # " webid": douyin_js_obj.call("get_web_id"),
            # "msToken": local_storage.get("xmst"),
            # "msToken": "abL8SeUTPa9-EToD8qfC7toScSADxpg6yLh2dbNcpWHzE0bT04txM_4UwquIcRvkRb9IU8sifwgM1Kwf1Lsld81o9Irt2_yNyUbbQPSUO8EfVlZJ_78FckDFnwVBVUVK",
        }
        params.update(common_params)
        query = '&'.join([f'{k}={v}' for k, v in params.items()])
        x_bogus = douyin_js_obj.call('sign', query, headers["User-Agent"])
        params["X-Bogus"] = x_bogus
        # print(x_bogus, query)
    async def request(self, method, url, **kwargs):
        async with httpx.AsyncClient(proxies=self.proxies) as client:
            response = await client.request(
                method, url, timeout=self.timeout,
                **kwargs
            )
            try:
                return response.json()
            except Exception as e:
                raise DataFetchError(f"{e}, {response.text}")
    async def get(self, uri: str, params: Optional[Dict] = None, headers: Optional[Dict] = None):
        await self.__process_req_params(params, headers)
        headers = headers or self.headers
        return await self.request(method="GET", url=f"{self._host}{uri}", params=params, headers=headers)
    async def post(self, uri: str, data: dict, headers: Optional[Dict] = None):
        await self.__process_req_params(data, headers)
        headers = headers or self.headers
        return await self.request(method="POST", url=f"{self._host}{uri}", data=data, headers=headers)
    async def pong(self, browser_context: BrowserContext) -> bool:
        local_storage = await self.playwright_page.evaluate("() => window.localStorage")
        if local_storage.get("HasUserLogin", "") == "1":
            return True
        _, cookie_dict = utils.convert_cookies(await browser_context.cookies())
        return cookie_dict.get("LOGIN_STATUS") == "1"
    async def update_cookies(self, browser_context: BrowserContext):
        cookie_str, cookie_dict = utils.convert_cookies(await browser_context.cookies())
        self.headers["Cookie"] = cookie_str
        self.cookie_dict = cookie_dict
    async def search_info_by_keyword(
            self,
            keyword: str,
            offset: int = 0,
            search_channel: SearchChannelType = SearchChannelType.GENERAL,
            sort_type: SearchSortType = SearchSortType.GENERAL,
            publish_time: PublishTimeType = PublishTimeType.UNLIMITED
    ):
        """
        DouYin Web Search API
        :param keyword:
        :param offset:
        :param search_channel:
        :param sort_type:
        :param publish_time: ·
        :return:
        """
        params = {
            "keyword": urllib.parse.quote(keyword),
            "search_channel": search_channel.value,
            "search_source": "normal_search",
            "query_correct_type": 1,
            "is_filter_search": 0,
            "offset": offset,
            "count": 10  # must be set to 10
        }
        if sort_type != SearchSortType.GENERAL or publish_time != PublishTimeType.UNLIMITED:
           params["filter_selected"] = urllib.parse.quote(json.dumps({
               "sort_type": str(sort_type.value),
               "publish_time": str(publish_time.value)
           }))
           params["is_filter_search"] = 1
           params["search_source"] = "tab_search"
        referer_url = "https://www.douyin.com/search/" + keyword
        referer_url += f"?publish_time={publish_time.value}&sort_type={sort_type.value}&type=general"
        headers = copy.copy(self.headers)
        headers["Referer"] = urllib.parse.quote(referer_url, safe=':/')
        return await self.get("/aweme/v1/web/general/search/single/", params, headers=headers)
    async def get_video_by_id(self, aweme_id: str) -> Any:
        """
        DouYin Video Detail API
        :param aweme_id:
        :return:
        """
        params = {
            "aweme_id": aweme_id
        }
        headers = copy.copy(self.headers)
        # headers["Cookie"] = "s_v_web_id=verify_lol4a8dv_wpQ1QMyP_xemd_4wON_8Yzr_FJa8DN1vdY2m;"
        del headers["Origin"]
        res = await self.get("/aweme/v1/web/aweme/detail/", params, headers)
        return res.get("aweme_detail", {})
    async def get_aweme_comments(self, aweme_id: str, cursor: int = 0):
        """get note comments
        """
        uri = "/aweme/v1/web/comment/list/"
        params = {
            "aweme_id": aweme_id,
            "cursor": cursor,
            "count": 20,
            "item_type": 0
        }
        keywords = request_keyword_var.get()
        referer_url = "https://www.douyin.com/search/" + keywords + '?aid=3a3cec5a-9e27-4040-b6aa-ef548c2c1138&publish_time=0&sort_type=0&source=search_history&type=general'
        headers = copy.copy(self.headers)
        headers["Referer"] = urllib.parse.quote(referer_url, safe=':/')
        return await self.get(uri, params)
    async def get_sub_comments(self, comment_id: str, cursor: int = 0):
        """
            获取子评论
        """
        uri = "/aweme/v1/web/comment/list/reply/"
        params = {
            'comment_id': comment_id,
            "cursor": cursor,
            "count": 20,
            "item_type": 0,
        }
        keywords = request_keyword_var.get()
        referer_url = "https://www.douyin.com/search/" + keywords + '?aid=3a3cec5a-9e27-4040-b6aa-ef548c2c1138&publish_time=0&sort_type=0&source=search_history&type=general'
        headers = copy.copy(self.headers)
        headers["Referer"] = urllib.parse.quote(referer_url, safe=':/')
        return await self.get(uri, params)
    async def get_aweme_all_comments(
            self,
            aweme_id: str,
            crawl_interval: float = 1.0,
            is_fetch_sub_comments=False,
            callback: Optional[Callable] = None,
    ):
        """
        获取帖子的所有评论，包括子评论
        :param aweme_id: 帖子ID
        :param crawl_interval: 抓取间隔
        :param is_fetch_sub_comments: 是否抓取子评论
        :param callback: 回调函数，用于处理抓取到的评论
        :return: 评论列表
        """
        result = []
        comments_has_more = 1
        comments_cursor = 0
        while comments_has_more:
            comments_res = await self.get_aweme_comments(aweme_id, comments_cursor)
            comments_has_more = comments_res.get("has_more", 0)
            comments_cursor = comments_res.get("cursor", 0)
            comments = comments_res.get("comments", [])
            if not comments:
                continue
            result.extend(comments)
            if callback:  # 如果有回调函数，就执行回调函数
                await callback(aweme_id, comments)
            await asyncio.sleep(crawl_interval)
            if not is_fetch_sub_comments:
                continue
            # 获取二级评论
            for comment in comments:
                reply_comment_total = comment.get("reply_comment_total")
                if reply_comment_total > 0:
                    comment_id = comment.get("cid")
                    sub_comments_has_more = 1
                    sub_comments_cursor = 0
                    while sub_comments_has_more:
                        sub_comments_res = await self.get_sub_comments(comment_id, sub_comments_cursor)
                        sub_comments_has_more = sub_comments_res.get("has_more", 0)
                        sub_comments_cursor = sub_comments_res.get("cursor", 0)
                        sub_comments = sub_comments_res.get("comments", [])
                        if not sub_comments:
                            continue
                        result.extend(sub_comments)
                        if callback:  # 如果有回调函数，就执行回调函数
                            await callback(aweme_id, sub_comments)
                        await asyncio.sleep(crawl_interval)
        return result
    async def get_user_info(self, sec_user_id: str):
        uri = "/aweme/v1/web/user/profile/other/"
        params = {
            "sec_user_id": sec_user_id,
            "publish_video_strategy_type": 2,
            "personal_center_strategy": 1,
        }
        return await self.get(uri, params)
    async def get_user_aweme_posts(self, sec_user_id: str, max_cursor: str = "") -> Dict:
        uri = "/aweme/v1/web/aweme/post/"
        params = {
            "sec_user_id": sec_user_id,
            "count": 18,
            "max_cursor": max_cursor,
            "locate_query": "false",
            "publish_video_strategy_type": 2
        }
        return await self.get(uri, params)
    async def get_all_user_aweme_posts(self, sec_user_id: str, callback: Optional[Callable] = None):
        posts_has_more = 1
        max_cursor = ""
        result = []
        while posts_has_more == 1:
            aweme_post_res = await self.get_user_aweme_posts(sec_user_id, max_cursor)
            posts_has_more = aweme_post_res.get("has_more", 0)
            max_cursor = aweme_post_res.get("max_cursor")
            aweme_list = aweme_post_res.get("aweme_list") if aweme_post_res.get("aweme_list") else []
            utils.logger.info(
                f"[DOUYINClient.get_all_user_aweme_posts] got sec_user_id:{sec_user_id} video len : {len(aweme_list)}")
            if callback:
                await callback(aweme_list)
            result.extend(aweme_list)
        return result
--- a/media_platform/douyin/core.py
+++ b/media_platform/douyin/core.py
@ -0,0 +1,271 @@
 import asyncio
 import os
 import random
 from asyncio import Task
 from typing import Any, Dict, List, Optional, Tuple
 from playwright.async_api import (BrowserContext, BrowserType, Page,
                                  async_playwright)
 import config
 from base.base_crawler import AbstractCrawler
 from proxy.proxy_ip_pool import IpInfoModel, create_ip_pool
 from store import douyin as douyin_store
 from tools import utils
 from var import crawler_type_var
 from .client import DOUYINClient
 from .exception import DataFetchError
 from .field import PublishTimeType
 from .login import DouYinLogin
 class DouYinCrawler(AbstractCrawler):
    context_page: Page
    dy_client: DOUYINClient
    browser_context: BrowserContext
    def __init__(self) -> None:
        self.user_agent = "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/109.0.0.0 Safari/537.36"  # fixed
        self.index_url = "https://www.douyin.com"
    async def start(self) -> None:
        playwright_proxy_format, httpx_proxy_format = None, None
        if config.ENABLE_IP_PROXY:
            ip_proxy_pool = await create_ip_pool(config.IP_PROXY_POOL_COUNT, enable_validate_ip=True)
            ip_proxy_info: IpInfoModel = await ip_proxy_pool.get_proxy()
            playwright_proxy_format, httpx_proxy_format = self.format_proxy_info(ip_proxy_info)
        async with async_playwright() as playwright:
            # Launch a browser context.
            chromium = playwright.chromium
            self.browser_context = await self.launch_browser(
                chromium,
                None,
                self.user_agent,
                headless=config.HEADLESS
            )
            # stealth.min.js is a js script to prevent the website from detecting the crawler.
            await self.browser_context.add_init_script(path="libs/stealth.min.js")
            self.context_page = await self.browser_context.new_page()
            await self.context_page.goto(self.index_url)
            self.dy_client = await self.create_douyin_client(httpx_proxy_format)
            if not await self.dy_client.pong(browser_context=self.browser_context):
                login_obj = DouYinLogin(
                    login_type=config.LOGIN_TYPE,
                    login_phone="",  # you phone number
                    browser_context=self.browser_context,
                    context_page=self.context_page,
                    cookie_str=config.COOKIES
                )
                await login_obj.begin()
                await self.dy_client.update_cookies(browser_context=self.browser_context)
            crawler_type_var.set(config.CRAWLER_TYPE)
            if config.CRAWLER_TYPE == "search":
                # Search for notes and retrieve their comment information.
                await self.search()
            elif config.CRAWLER_TYPE == "detail":
                # Get the information and comments of the specified post
                await self.get_specified_awemes()
            elif config.CRAWLER_TYPE == "creator":
                # Get the information and comments of the specified creator
                await self.get_creators_and_videos()
            utils.logger.info("[DouYinCrawler.start] Douyin Crawler finished ...")
    async def search(self) -> None:
        utils.logger.info("[DouYinCrawler.search] Begin search douyin keywords")
        dy_limit_count = 10  # douyin limit page fixed value
        if config.CRAWLER_MAX_NOTES_COUNT < dy_limit_count:
            config.CRAWLER_MAX_NOTES_COUNT = dy_limit_count
        start_page = config.START_PAGE  # start page number
        for keyword in config.KEYWORDS.split(","):
            utils.logger.info(f"[DouYinCrawler.search] Current keyword: {keyword}")
            aweme_list: List[str] = []
            page = 0
            while (page - start_page + 1) * dy_limit_count <= config.CRAWLER_MAX_NOTES_COUNT:
                if page < start_page:
                    utils.logger.info(f"[DouYinCrawler.search] Skip {page}")
                    page += 1
                    continue
                try:
                    utils.logger.info(f"[DouYinCrawler.search] search douyin keyword: {keyword}, page: {page}")
                    posts_res = await self.dy_client.search_info_by_keyword(keyword=keyword,
                                                                            offset=page * dy_limit_count - dy_limit_count,
                                                                            publish_time=PublishTimeType(config.PUBLISH_TIME_TYPE)
                                                                            )
                except DataFetchError:
                    utils.logger.error(f"[DouYinCrawler.search] search douyin keyword: {keyword} failed")
                    break
                page += 1
                if "data" not in posts_res:
                    utils.logger.error(
                        f"[DouYinCrawler.search] search douyin keyword: {keyword} failed，账号也许被风控了。")
                    break
                for post_item in posts_res.get("data"):
                    try:
                        aweme_info: Dict = post_item.get("aweme_info") or \
                                           post_item.get("aweme_mix_info", {}).get("mix_items")[0]
                    except TypeError:
                        continue
                    aweme_list.append(aweme_info.get("aweme_id", ""))
                    await douyin_store.update_douyin_aweme(aweme_item=aweme_info)
            utils.logger.info(f"[DouYinCrawler.search] keyword:{keyword}, aweme_list:{aweme_list}")
            await self.batch_get_note_comments(aweme_list)
    async def get_specified_awemes(self):
        """Get the information and comments of the specified post"""
        semaphore = asyncio.Semaphore(config.MAX_CONCURRENCY_NUM)
        task_list = [
            self.get_aweme_detail(aweme_id=aweme_id, semaphore=semaphore) for aweme_id in config.DY_SPECIFIED_ID_LIST
        ]
        aweme_details = await asyncio.gather(*task_list)
        for aweme_detail in aweme_details:
            if aweme_detail is not None:
                await douyin_store.update_douyin_aweme(aweme_detail)
        await self.batch_get_note_comments(config.DY_SPECIFIED_ID_LIST)
    async def get_aweme_detail(self, aweme_id: str, semaphore: asyncio.Semaphore) -> Any:
        """Get note detail"""
        async with semaphore:
            try:
                return await self.dy_client.get_video_by_id(aweme_id)
            except DataFetchError as ex:
                utils.logger.error(f"[DouYinCrawler.get_aweme_detail] Get aweme detail error: {ex}")
                return None
            except KeyError as ex:
                utils.logger.error(
                    f"[DouYinCrawler.get_aweme_detail] have not fund note detail aweme_id:{aweme_id}, err: {ex}")
                return None
    async def batch_get_note_comments(self, aweme_list: List[str]) -> None:
        """
        Batch get note comments
        """
        if not config.ENABLE_GET_COMMENTS:
            utils.logger.info(f"[DouYinCrawler.batch_get_note_comments] Crawling comment mode is not enabled")
            return
        task_list: List[Task] = []
        semaphore = asyncio.Semaphore(config.MAX_CONCURRENCY_NUM)
        for aweme_id in aweme_list:
            task = asyncio.create_task(
                self.get_comments(aweme_id, semaphore), name=aweme_id)
            task_list.append(task)
        if len(task_list) > 0:
            await asyncio.wait(task_list)
    async def get_comments(self, aweme_id: str, semaphore: asyncio.Semaphore) -> None:
        async with semaphore:
            try:
                # 将关键词列表传递给 get_aweme_all_comments 方法
                await self.dy_client.get_aweme_all_comments(
                    aweme_id=aweme_id,
                    crawl_interval=random.random(),
                    is_fetch_sub_comments=config.ENABLE_GET_SUB_COMMENTS,
                    callback=douyin_store.batch_update_dy_aweme_comments
                )
                utils.logger.info(
                    f"[DouYinCrawler.get_comments] aweme_id: {aweme_id} comments have all been obtained and filtered ...")
            except DataFetchError as e:
                utils.logger.error(f"[DouYinCrawler.get_comments] aweme_id: {aweme_id} get comments failed, error: {e}")
    async def get_creators_and_videos(self) -> None:
        """
        Get the information and videos of the specified creator
        """
        utils.logger.info("[DouYinCrawler.get_creators_and_videos] Begin get douyin creators")
        for user_id in config.DY_CREATOR_ID_LIST:
            creator_info: Dict = await self.dy_client.get_user_info(user_id)
            if creator_info:
                await douyin_store.save_creator(user_id, creator=creator_info)
            # Get all video information of the creator
            all_video_list = await self.dy_client.get_all_user_aweme_posts(
                sec_user_id=user_id,
                callback=self.fetch_creator_video_detail
            )
            video_ids = [video_item.get("aweme_id") for video_item in all_video_list]
            await self.batch_get_note_comments(video_ids)
    async def fetch_creator_video_detail(self, video_list: List[Dict]):
        """
        Concurrently obtain the specified post list and save the data
        """
        semaphore = asyncio.Semaphore(config.MAX_CONCURRENCY_NUM)
        task_list = [
            self.get_aweme_detail(post_item.get("aweme_id"), semaphore) for post_item in video_list
        ]
        note_details = await asyncio.gather(*task_list)
        for aweme_item in note_details:
            if aweme_item is not None:
                await douyin_store.update_douyin_aweme(aweme_item)
    @staticmethod
    def format_proxy_info(ip_proxy_info: IpInfoModel) -> Tuple[Optional[Dict], Optional[Dict]]:
        """format proxy info for playwright and httpx"""
        playwright_proxy = {
            "server": f"{ip_proxy_info.protocol}{ip_proxy_info.ip}:{ip_proxy_info.port}",
            "username": ip_proxy_info.user,
            "password": ip_proxy_info.password,
        }
        httpx_proxy = {
            f"{ip_proxy_info.protocol}": f"http://{ip_proxy_info.user}:{ip_proxy_info.password}@{ip_proxy_info.ip}:{ip_proxy_info.port}"
        }
        return playwright_proxy, httpx_proxy
    async def create_douyin_client(self, httpx_proxy: Optional[str]) -> DOUYINClient:
        """Create douyin client"""
        cookie_str, cookie_dict = utils.convert_cookies(await self.browser_context.cookies())  # type: ignore
        douyin_client = DOUYINClient(
            proxies=httpx_proxy,
            headers={
                "User-Agent": self.user_agent,
                "Cookie": cookie_str,
                "Host": "www.douyin.com",
                "Origin": "https://www.douyin.com/",
                "Referer": "https://www.douyin.com/",
                "Content-Type": "application/json;charset=UTF-8"
            },
            playwright_page=self.context_page,
            cookie_dict=cookie_dict,
        )
        return douyin_client
    async def launch_browser(
            self,
            chromium: BrowserType,
            playwright_proxy: Optional[Dict],
            user_agent: Optional[str],
            headless: bool = True
    ) -> BrowserContext:
        """Launch browser and create browser context"""
        if config.SAVE_LOGIN_STATE:
            user_data_dir = os.path.join(os.getcwd(), "browser_data",
                                         config.USER_DATA_DIR % config.PLATFORM)  # type: ignore
            browser_context = await chromium.launch_persistent_context(
                user_data_dir=user_data_dir,
                accept_downloads=True,
                headless=headless,
                proxy=playwright_proxy,  # type: ignore
                viewport={"width": 1920, "height": 1080},
                user_agent=user_agent
            )  # type: ignore
            return browser_context
        else:
            browser = await chromium.launch(headless=headless, proxy=playwright_proxy)  # type: ignore
            browser_context = await browser.new_context(
                viewport={"width": 1920, "height": 1080},
                user_agent=user_agent
            )
            return browser_context
    async def close(self) -> None:
        """Close browser context"""
        await self.browser_context.close()
        utils.logger.info("[DouYinCrawler.close] Browser context closed ...")
--- a/media_platform/douyin/exception.py
+++ b/media_platform/douyin/exception.py
@ -0,0 +1,9 @@
 from httpx import RequestError
 class DataFetchError(RequestError):
    """something error when fetch"""
 class IPBlockError(RequestError):
    """fetch so fast that the server block us ip"""
--- a/media_platform/douyin/field.py
+++ b/media_platform/douyin/field.py
@ -0,0 +1,23 @@
 from enum import Enum
 class SearchChannelType(Enum):
    """search channel type"""
    GENERAL = "aweme_general"  # 综合
    VIDEO = "aweme_video_web"  # 视频
    USER = "aweme_user_web"  # 用户
    LIVE = "aweme_live"  # 直播
 class SearchSortType(Enum):
    """search sort type"""
    GENERAL = 0  # 综合排序
    MOST_LIKE = 1  # 最多点赞
    LATEST = 2  # 最新发布
 class PublishTimeType(Enum):
    """publish time type"""
    UNLIMITED = 0  # 不限
    ONE_DAY = 1  # 一天内
    ONE_WEEK = 7  # 一周内
    SIX_MONTH = 180  # 半年内
--- a/media_platform/douyin/login.py
+++ b/media_platform/douyin/login.py
@ -0,0 +1,254 @@
 import asyncio
 import functools
 import sys
 from typing import Optional
 from playwright.async_api import BrowserContext, Page
 from playwright.async_api import TimeoutError as PlaywrightTimeoutError
 from tenacity import (RetryError, retry, retry_if_result, stop_after_attempt,
                      wait_fixed)
 import config
 from base.base_crawler import AbstractLogin
 from cache.cache_factory import CacheFactory
 from tools import utils
 class DouYinLogin(AbstractLogin):
    def __init__(self,
                 login_type: str,
                 browser_context: BrowserContext, # type: ignore
                 context_page: Page, # type: ignore
                 login_phone: Optional[str] = "",
                 cookie_str: Optional[str] = ""
                 ):
        config.LOGIN_TYPE = login_type
        self.browser_context = browser_context
        self.context_page = context_page
        self.login_phone = login_phone
        self.scan_qrcode_time = 60
        self.cookie_str = cookie_str
    async def begin(self):
        """
            Start login douyin website
            滑块中间页面的验证准确率不太OK... 如果没有特俗要求，建议不开抖音登录，或者使用cookies登录
        """
        # popup login dialog
        await self.popup_login_dialog()
        # select login type
        if config.LOGIN_TYPE == "qrcode":
            await self.login_by_qrcode()
        elif config.LOGIN_TYPE == "phone":
            await self.login_by_mobile()
        elif config.LOGIN_TYPE == "cookie":
            await self.login_by_cookies()
        else:
            raise ValueError("[DouYinLogin.begin] Invalid Login Type Currently only supported qrcode or phone or cookie ...")
        # 如果页面重定向到滑动验证码页面，需要再次滑动滑块
        await asyncio.sleep(6)
        current_page_title = await self.context_page.title()
        if "验证码中间页" in current_page_title:
            await self.check_page_display_slider(move_step=3, slider_level="hard")
        # check login state
        utils.logger.info(f"[DouYinLogin.begin] login finished then check login state ...")
        try:
            await self.check_login_state()
        except RetryError:
            utils.logger.info("[DouYinLogin.begin] login failed please confirm ...")
            sys.exit()
        # wait for redirect
        wait_redirect_seconds = 5
        utils.logger.info(f"[DouYinLogin.begin] Login successful then wait for {wait_redirect_seconds} seconds redirect ...")
        await asyncio.sleep(wait_redirect_seconds)
    @retry(stop=stop_after_attempt(600), wait=wait_fixed(1), retry=retry_if_result(lambda value: value is False))
    async def check_login_state(self):
        """Check if the current login status is successful and return True otherwise return False"""
        current_cookie = await self.browser_context.cookies()
        _, cookie_dict = utils.convert_cookies(current_cookie)
        for page in self.browser_context.pages:
            try:
                local_storage = await page.evaluate("() => window.localStorage")
                if local_storage.get("HasUserLogin", "") == "1":
                    return True
            except Exception as e:
                # utils.logger.warn(f"[DouYinLogin] check_login_state waring: {e}")
                await asyncio.sleep(0.1)
        if cookie_dict.get("LOGIN_STATUS") == "1":
            return True
        return False
    async def popup_login_dialog(self):
        """If the login dialog box does not pop up automatically, we will manually click the login button"""
        dialog_selector = "xpath=//div[@id='login-pannel']"
        try:
            # check dialog box is auto popup and wait for 10 seconds
            await self.context_page.wait_for_selector(dialog_selector, timeout=1000 * 10)
        except Exception as e:
            utils.logger.error(f"[DouYinLogin.popup_login_dialog] login dialog box does not pop up automatically, error: {e}")
            utils.logger.info("[DouYinLogin.popup_login_dialog] login dialog box does not pop up automatically, we will manually click the login button")
            login_button_ele = self.context_page.locator("xpath=//p[text() = '登录']")
            await login_button_ele.click()
            await asyncio.sleep(0.5)
    async def login_by_qrcode(self):
        utils.logger.info("[DouYinLogin.login_by_qrcode] Begin login douyin by qrcode...")
        qrcode_img_selector = "xpath=//article[@class='web-login']//img"
        base64_qrcode_img = await utils.find_login_qrcode(
            self.context_page,
            selector=qrcode_img_selector
        )
        if not base64_qrcode_img:
            utils.logger.info("[DouYinLogin.login_by_qrcode] login qrcode not found please confirm ...")
            sys.exit()
        partial_show_qrcode = functools.partial(utils.show_qrcode, base64_qrcode_img)
        asyncio.get_running_loop().run_in_executor(executor=None, func=partial_show_qrcode)
        await asyncio.sleep(2)
    async def login_by_mobile(self):
        utils.logger.info("[DouYinLogin.login_by_mobile] Begin login douyin by mobile ...")
        mobile_tap_ele = self.context_page.locator("xpath=//li[text() = '验证码登录']")
        await mobile_tap_ele.click()
        await self.context_page.wait_for_selector("xpath=//article[@class='web-login-mobile-code']")
        mobile_input_ele = self.context_page.locator("xpath=//input[@placeholder='手机号']")
        await mobile_input_ele.fill(self.login_phone)
        await asyncio.sleep(0.5)
        send_sms_code_btn = self.context_page.locator("xpath=//span[text() = '获取验证码']")
        await send_sms_code_btn.click()
        # 检查是否有滑动验证码
        await self.check_page_display_slider(move_step=10, slider_level="easy")
        cache_client = CacheFactory.create_cache(config.CACHE_TYPE_MEMORY)
        max_get_sms_code_time = 60 * 2  # 最长获取验证码的时间为2分钟
        while max_get_sms_code_time > 0:
            utils.logger.info(f"[DouYinLogin.login_by_mobile] get douyin sms code from redis remaining time {max_get_sms_code_time}s ...")
            await asyncio.sleep(1)
            sms_code_key = f"dy_{self.login_phone}"
            sms_code_value = cache_client.get(sms_code_key)
            if not sms_code_value:
                max_get_sms_code_time -= 1
                continue
            sms_code_input_ele = self.context_page.locator("xpath=//input[@placeholder='请输入验证码']")
            await sms_code_input_ele.fill(value=sms_code_value.decode())
            await asyncio.sleep(0.5)
            submit_btn_ele = self.context_page.locator("xpath=//button[@class='web-login-button']")
            await submit_btn_ele.click()  # 点击登录
            # todo ... 应该还需要检查验证码的正确性有可能输入的验证码不正确
            break
    async def check_page_display_slider(self, move_step: int = 10, slider_level: str = "easy"):
        """
        检查页面是否出现滑动验证码
        :return:
        """
        # 等待滑动验证码的出现
        back_selector = "#captcha-verify-image"
        try:
            await self.context_page.wait_for_selector(selector=back_selector, state="visible", timeout=30 * 1000)
        except PlaywrightTimeoutError:  # 没有滑动验证码，直接返回
            return
        gap_selector = 'xpath=//*[@id="captcha_container"]/div/div[2]/img[2]'
        max_slider_try_times = 20
        slider_verify_success = False
        while not slider_verify_success:
            if max_slider_try_times <= 0:
                utils.logger.error("[DouYinLogin.check_page_display_slider] slider verify failed ...")
                sys.exit()
            try:
                await self.move_slider(back_selector, gap_selector, move_step, slider_level)
                await asyncio.sleep(1)
                # 如果滑块滑动慢了，或者验证失败了，会提示操作过慢，这里点一下刷新按钮
                page_content = await self.context_page.content()
                if "操作过慢" in page_content or "提示重新操作" in page_content:
                    utils.logger.info("[DouYinLogin.check_page_display_slider] slider verify failed, retry ...")
                    await self.context_page.click(selector="//a[contains(@class, 'secsdk_captcha_refresh')]")
                    continue
                # 滑动成功后，等待滑块消失
                await self.context_page.wait_for_selector(selector=back_selector, state="hidden", timeout=1000)
                # 如果滑块消失了，说明验证成功了，跳出循环，如果没有消失，说明验证失败了，上面这一行代码会抛出异常被捕获后继续循环滑动验证码
                utils.logger.info("[DouYinLogin.check_page_display_slider] slider verify success ...")
                slider_verify_success = True
            except Exception as e:
                utils.logger.error(f"[DouYinLogin.check_page_display_slider] slider verify failed, error: {e}")
                await asyncio.sleep(1)
                max_slider_try_times -= 1
                utils.logger.info(f"[DouYinLogin.check_page_display_slider] remaining slider try times: {max_slider_try_times}")
                continue
    async def move_slider(self, back_selector: str, gap_selector: str, move_step: int = 10, slider_level="easy"):
        """
        Move the slider to the right to complete the verification
        :param back_selector: 滑动验证码背景图片的选择器
        :param gap_selector:  滑动验证码的滑块选择器
        :param move_step: 是控制单次移动速度的比例是1/10 默认是1 相当于 传入的这个距离不管多远0.1秒钟移动完 越大越慢
        :param slider_level: 滑块难度 easy hard,分别对应手机验证码的滑块和验证码中间的滑块
        :return:
        """
        # get slider background image
        slider_back_elements = await self.context_page.wait_for_selector(
            selector=back_selector,
            timeout=1000 * 10,  # wait 10 seconds
        )
        slide_back = str(await slider_back_elements.get_property("src")) # type: ignore
        # get slider gap image
        gap_elements = await self.context_page.wait_for_selector(
            selector=gap_selector,
            timeout=1000 * 10,  # wait 10 seconds
        )
        gap_src = str(await gap_elements.get_property("src")) # type: ignore
        # 识别滑块位置
        slide_app = utils.Slide(gap=gap_src, bg=slide_back)
        distance = slide_app.discern()
        # 获取移动轨迹
        tracks = utils.get_tracks(distance, slider_level)
        new_1 = tracks[-1] - (sum(tracks) - distance)
        tracks.pop()
        tracks.append(new_1)
        # 根据轨迹拖拽滑块到指定位置
        element = await self.context_page.query_selector(gap_selector)
        bounding_box = await element.bounding_box() # type: ignore
        await self.context_page.mouse.move(bounding_box["x"] + bounding_box["width"] / 2, # type: ignore
                                           bounding_box["y"] + bounding_box["height"] / 2) # type: ignore
        # 这里获取到x坐标中心点位置
        x = bounding_box["x"] + bounding_box["width"] / 2 # type: ignore
        # 模拟滑动操作
        await element.hover() # type: ignore
        await self.context_page.mouse.down()
        for track in tracks:
            # 循环鼠标按照轨迹移动
            # steps 是控制单次移动速度的比例是1/10 默认是1 相当于 传入的这个距离不管多远0.1秒钟移动完 越大越慢
            await self.context_page.mouse.move(x + track, 0, steps=move_step)
            x += track
        await self.context_page.mouse.up()
    async def login_by_cookies(self):
        utils.logger.info("[DouYinLogin.login_by_cookies] Begin login douyin by cookie ...")
        for key, value in utils.convert_str_cookie_to_dict(self.cookie_str).items():
            await self.browser_context.add_cookies([{
                'name': key,
                'value': value,
                'domain': ".douyin.com",
                'path': "/"
            }])
--- a/media_platform/kuaishou/init.py
+++ b/media_platform/kuaishou/init.py
@ -0,0 +1,2 @@
 # -*- coding: utf-8 -*-
 from .core import KuaishouCrawler
--- a/media_platform/kuaishou/client.py
+++ b/media_platform/kuaishou/client.py
@ -0,0 +1,307 @@
 # -*- coding: utf-8 -*-
 import asyncio
 import json
 from typing import Any, Callable, Dict, List, Optional
 from urllib.parse import urlencode
 import httpx
 from playwright.async_api import BrowserContext, Page
 import config
 from base.base_crawler import AbstractApiClient
 from tools import utils
 from .exception import DataFetchError
 from .graphql import KuaiShouGraphQL
 class KuaiShouClient(AbstractApiClient):
    def __init__(
            self,
            timeout=10,
            proxies=None,
            *,
            headers: Dict[str, str],
            playwright_page: Page,
            cookie_dict: Dict[str, str],
    ):
        self.proxies = proxies
        self.timeout = timeout
        self.headers = headers
        self._host = "https://www.kuaishou.com/graphql"
        self.playwright_page = playwright_page
        self.cookie_dict = cookie_dict
        self.graphql = KuaiShouGraphQL()
    async def request(self, method, url, **kwargs) -> Any:
        async with httpx.AsyncClient(proxies=self.proxies) as client:
            response = await client.request(
                method, url, timeout=self.timeout,
                **kwargs
            )
        data: Dict = response.json()
        if data.get("errors"):
            raise DataFetchError(data.get("errors", "unkonw error"))
        else:
            return data.get("data", {})
    async def get(self, uri: str, params=None) -> Dict:
        final_uri = uri
        if isinstance(params, dict):
            final_uri = (f"{uri}?"
                         f"{urlencode(params)}")
        return await self.request(method="GET", url=f"{self._host}{final_uri}", headers=self.headers)
    async def post(self, uri: str, data: dict) -> Dict:
        json_str = json.dumps(data, separators=(',', ':'), ensure_ascii=False)
        return await self.request(method="POST", url=f"{self._host}{uri}",
                                  data=json_str, headers=self.headers)
    async def pong(self) -> bool:
        """get a note to check if login state is ok"""
        utils.logger.info("[KuaiShouClient.pong] Begin pong kuaishou...")
        ping_flag = False
        try:
            post_data = {
                "operationName": "visionProfileUserList",
                "variables": {
                    "ftype": 1,
                },
                "query": self.graphql.get("vision_profile_user_list")
            }
            res = await self.post("", post_data)
            if res.get("visionProfileUserList", {}).get("result") == 1:
                ping_flag = True
        except Exception as e:
            utils.logger.error(f"[KuaiShouClient.pong] Pong kuaishou failed: {e}, and try to login again...")
            ping_flag = False
        return ping_flag
    async def update_cookies(self, browser_context: BrowserContext):
        cookie_str, cookie_dict = utils.convert_cookies(await browser_context.cookies())
        self.headers["Cookie"] = cookie_str
        self.cookie_dict = cookie_dict
    async def search_info_by_keyword(self, keyword: str, pcursor: str):
        """
        KuaiShou web search api
        :param keyword: search keyword
        :param pcursor: limite page curson
        :return:
        """
        post_data = {
            "operationName": "visionSearchPhoto",
            "variables": {
                "keyword": keyword,
                "pcursor": pcursor,
                "page": "search"
            },
            "query": self.graphql.get("search_query")
        }
        return await self.post("", post_data)
    async def get_video_info(self, photo_id: str) -> Dict:
        """
        Kuaishou web video detail api
        :param photo_id:
        :return:
        """
        post_data = {
            "operationName": "visionVideoDetail",
            "variables": {
                "photoId": photo_id,
                "page": "search"
            },
            "query": self.graphql.get("video_detail")
        }
        return await self.post("", post_data)
    async def get_video_comments(self, photo_id: str, pcursor: str = "") -> Dict:
        """get video comments
        :param photo_id: photo id you want to fetch
        :param pcursor: last you get pcursor, defaults to ""
        :return:
        """
        post_data = {
            "operationName": "commentListQuery",
            "variables": {
                "photoId": photo_id,
                "pcursor": pcursor
            },
            "query": self.graphql.get("comment_list")
        }
        return await self.post("", post_data)
    async def get_video_sub_comments(
        self, photo_id: str, rootCommentId: str, pcursor: str = ""
    ) -> Dict:
        """get video sub comments
        :param photo_id: photo id you want to fetch
        :param pcursor: last you get pcursor, defaults to ""
        :return:
        """
        post_data = {
            "operationName": "visionSubCommentList",
            "variables": {
                "photoId": photo_id,
                "pcursor": pcursor,
                "rootCommentId": rootCommentId,
            },
            "query": self.graphql.get("vision_sub_comment_list"),
        }
        return await self.post("", post_data)
    async def get_creator_profile(self, userId: str) -> Dict:
        post_data = {
            "operationName": "visionProfile",
            "variables": {
                "userId": userId
            },
            "query": self.graphql.get("vision_profile"),
        }
        return await self.post("", post_data)
    async def get_video_by_creater(self, userId: str, pcursor: str = "") -> Dict:
        post_data = {
            "operationName": "visionProfilePhotoList",
            "variables": {
                "page": "profile", 
                "pcursor": pcursor, 
                "userId": userId
            },
            "query": self.graphql.get("vision_profile_photo_list"),
        }
        return await self.post("", post_data)
    async def get_video_all_comments(
        self,
        photo_id: str,
        crawl_interval: float = 1.0,
        callback: Optional[Callable] = None,
    ):
        """
        get video all comments include sub comments
        :param photo_id:
        :param crawl_interval:
        :param callback:
        :return:
        """
        result = []
        pcursor = ""
        while pcursor != "no_more":
            comments_res = await self.get_video_comments(photo_id, pcursor)
            vision_commen_list = comments_res.get("visionCommentList", {})
            pcursor = vision_commen_list.get("pcursor", "")
            comments = vision_commen_list.get("rootComments", [])
            if callback:  # 如果有回调函数，就执行回调函数
                await callback(photo_id, comments)
            result.extend(comments)
            await asyncio.sleep(crawl_interval)
            sub_comments = await self.get_comments_all_sub_comments(
                comments, photo_id, crawl_interval, callback
            )
            result.extend(sub_comments)
        return result
    async def get_comments_all_sub_comments(
        self,
        comments: List[Dict],
        photo_id,
        crawl_interval: float = 1.0,
        callback: Optional[Callable] = None,
    ) -> List[Dict]:
        """
        获取指定一级评论下的所有二级评论, 该方法会一直查找一级评论下的所有二级评论信息
        Args:
            comments: 评论列表
            photo_id: 视频id
            crawl_interval: 爬取一次评论的延迟单位（秒）
            callback: 一次评论爬取结束后
        Returns:
        """
        if not config.ENABLE_GET_SUB_COMMENTS:
            utils.logger.info(
                f"[KuaiShouClient.get_comments_all_sub_comments] Crawling sub_comment mode is not enabled"
            )
            return []
        result = []
        for comment in comments:
            sub_comments = comment.get("subComments")
            if sub_comments and callback:
                await callback(photo_id, sub_comments)
            sub_comment_pcursor = comment.get("subCommentsPcursor")
            if sub_comment_pcursor == "no_more":
                continue
            root_comment_id = comment.get("commentId")
            sub_comment_pcursor = ""
            while sub_comment_pcursor != "no_more":
                comments_res = await self.get_video_sub_comments(
                    photo_id, root_comment_id, sub_comment_pcursor
                )
                vision_sub_comment_list = comments_res.get("visionSubCommentList",{})
                sub_comment_pcursor = vision_sub_comment_list.get("pcursor", "no_more")
                comments = vision_sub_comment_list.get("subComments", {})
                if callback:
                    await callback(photo_id, comments)
                await asyncio.sleep(crawl_interval)
                result.extend(comments)
        return result
    async def get_creator_info(self, user_id: str) -> Dict:
        """
        eg: https://www.kuaishou.com/profile/3x4jtnbfter525a
        快手用户主页
        """
        visionProfile = await self.get_creator_profile(user_id)
        return visionProfile.get("userProfile")
    async def get_all_videos_by_creator(
        self,
        user_id: str,
        crawl_interval: float = 1.0,
        callback: Optional[Callable] = None,
    ) -> List[Dict]:
        """
        获取指定用户下的所有发过的帖子，该方法会一直查找一个用户下的所有帖子信息
        Args:
            user_id: 用户ID
            crawl_interval: 爬取一次的延迟单位（秒）
            callback: 一次分页爬取结束后的更新回调函数
        Returns:
        """
        result = []
        pcursor = ""
        while pcursor != "no_more":
            videos_res = await self.get_video_by_creater(user_id, pcursor)
            if not videos_res:
                utils.logger.error(
                    f"[KuaiShouClient.get_all_videos_by_creator] The current creator may have been banned by ks, so they cannot access the data."
                )
                break
            vision_profile_photo_list = videos_res.get("visionProfilePhotoList", {})
            pcursor = vision_profile_photo_list.get("pcursor", "")
            videos = vision_profile_photo_list.get("feeds", [])
            utils.logger.info(
                f"[KuaiShouClient.get_all_videos_by_creator] got user_id:{user_id} videos len : {len(videos)}"
            )
            if callback:
                await callback(videos)
            await asyncio.sleep(crawl_interval)
            result.extend(videos)
        return result
--- a/media_platform/kuaishou/core.py
+++ b/media_platform/kuaishou/core.py
@ -0,0 +1,288 @@
 import asyncio
 import os
 import random
 import time
 from asyncio import Task
 from typing import Dict, List, Optional, Tuple
 from playwright.async_api import (BrowserContext, BrowserType, Page,
                                  async_playwright)
 import config
 from base.base_crawler import AbstractCrawler
 from proxy.proxy_ip_pool import IpInfoModel, create_ip_pool
 from store import kuaishou as kuaishou_store
 from tools import utils
 from var import comment_tasks_var, crawler_type_var
 from .client import KuaiShouClient
 from .exception import DataFetchError
 from .login import KuaishouLogin
 class KuaishouCrawler(AbstractCrawler):
    context_page: Page
    ks_client: KuaiShouClient
    browser_context: BrowserContext
    def __init__(self):
        self.index_url = "https://www.kuaishou.com"
        self.user_agent = utils.get_user_agent()
    async def start(self):
        playwright_proxy_format, httpx_proxy_format = None, None
        if config.ENABLE_IP_PROXY:
            ip_proxy_pool = await create_ip_pool(config.IP_PROXY_POOL_COUNT, enable_validate_ip=True)
            ip_proxy_info: IpInfoModel = await ip_proxy_pool.get_proxy()
            playwright_proxy_format, httpx_proxy_format = self.format_proxy_info(ip_proxy_info)
        async with async_playwright() as playwright:
            # Launch a browser context.
            chromium = playwright.chromium
            self.browser_context = await self.launch_browser(
                chromium,
                None,
                self.user_agent,
                headless=config.HEADLESS
            )
            # stealth.min.js is a js script to prevent the website from detecting the crawler.
            await self.browser_context.add_init_script(path="libs/stealth.min.js")
            self.context_page = await self.browser_context.new_page()
            await self.context_page.goto(f"{self.index_url}?isHome=1")
            # Create a client to interact with the kuaishou website.
            self.ks_client = await self.create_ks_client(httpx_proxy_format)
            if not await self.ks_client.pong():
                login_obj = KuaishouLogin(
                    login_type=config.LOGIN_TYPE,
                    login_phone=httpx_proxy_format,
                    browser_context=self.browser_context,
                    context_page=self.context_page,
                    cookie_str=config.COOKIES
                )
                await login_obj.begin()
                await self.ks_client.update_cookies(browser_context=self.browser_context)
            crawler_type_var.set(config.CRAWLER_TYPE)
            if config.CRAWLER_TYPE == "search":
                # Search for videos and retrieve their comment information.
                await self.search()
            elif config.CRAWLER_TYPE == "detail":
                # Get the information and comments of the specified post
                await self.get_specified_videos()
            elif config.CRAWLER_TYPE == "creator":
                # Get creator's information and their videos and comments
                await self.get_creators_and_videos()
            else:
                pass
            utils.logger.info("[KuaishouCrawler.start] Kuaishou Crawler finished ...")
    async def search(self):
        utils.logger.info("[KuaishouCrawler.search] Begin search kuaishou keywords")
        ks_limit_count = 20  # kuaishou limit page fixed value
        if config.CRAWLER_MAX_NOTES_COUNT < ks_limit_count:
            config.CRAWLER_MAX_NOTES_COUNT = ks_limit_count
        start_page = config.START_PAGE
        for keyword in config.KEYWORDS.split(","):
            utils.logger.info(f"[KuaishouCrawler.search] Current search keyword: {keyword}")
            page = 1
            while (page - start_page + 1) * ks_limit_count <= config.CRAWLER_MAX_NOTES_COUNT:
                if page < start_page:
                    utils.logger.info(f"[KuaishouCrawler.search] Skip page: {page}")
                    page += 1
                    continue
                utils.logger.info(f"[KuaishouCrawler.search] search kuaishou keyword: {keyword}, page: {page}")
                video_id_list: List[str] = []
                videos_res = await self.ks_client.search_info_by_keyword(
                    keyword=keyword,
                    pcursor=str(page),
                )
                if not videos_res:
                    utils.logger.error(f"[KuaishouCrawler.search] search info by keyword:{keyword} not found data")
                    continue
                vision_search_photo: Dict = videos_res.get("visionSearchPhoto")
                if vision_search_photo.get("result") != 1:
                    utils.logger.error(f"[KuaishouCrawler.search] search info by keyword:{keyword} not found data ")
                    continue
                for video_detail in vision_search_photo.get("feeds"):
                    video_id_list.append(video_detail.get("photo", {}).get("id"))
                    await kuaishou_store.update_kuaishou_video(video_item=video_detail)
                # batch fetch video comments
                page += 1
                await self.batch_get_video_comments(video_id_list)
    async def get_specified_videos(self):
        """Get the information and comments of the specified post"""
        semaphore = asyncio.Semaphore(config.MAX_CONCURRENCY_NUM)
        task_list = [
            self.get_video_info_task(video_id=video_id, semaphore=semaphore) for video_id in config.KS_SPECIFIED_ID_LIST
        ]
        video_details = await asyncio.gather(*task_list)
        for video_detail in video_details:
            if video_detail is not None:
                await kuaishou_store.update_kuaishou_video(video_detail)
        await self.batch_get_video_comments(config.KS_SPECIFIED_ID_LIST)
    async def get_video_info_task(self, video_id: str, semaphore: asyncio.Semaphore) -> Optional[Dict]:
        """Get video detail task"""
        async with semaphore:
            try:
                result = await self.ks_client.get_video_info(video_id)
                utils.logger.info(f"[KuaishouCrawler.get_video_info_task] Get video_id:{video_id} info result: {result} ...")
                return result.get("visionVideoDetail")
            except DataFetchError as ex:
                utils.logger.error(f"[KuaishouCrawler.get_video_info_task] Get video detail error: {ex}")
                return None
            except KeyError as ex:
                utils.logger.error(f"[KuaishouCrawler.get_video_info_task] have not fund video detail video_id:{video_id}, err: {ex}")
                return None
    async def batch_get_video_comments(self, video_id_list: List[str]):
        """
        batch get video comments
        :param video_id_list:
        :return:
        """
        if not config.ENABLE_GET_COMMENTS:
            utils.logger.info(f"[KuaishouCrawler.batch_get_video_comments] Crawling comment mode is not enabled")
            return
        utils.logger.info(f"[KuaishouCrawler.batch_get_video_comments] video ids:{video_id_list}")
        semaphore = asyncio.Semaphore(config.MAX_CONCURRENCY_NUM)
        task_list: List[Task] = []
        for video_id in video_id_list:
            task = asyncio.create_task(self.get_comments(video_id, semaphore), name=video_id)
            task_list.append(task)
        comment_tasks_var.set(task_list)
        await asyncio.gather(*task_list)
    async def get_comments(self, video_id: str, semaphore: asyncio.Semaphore):
        """
        get comment for video id
        :param video_id:
        :param semaphore:
        :return:
        """
        async with semaphore:
            try:
                utils.logger.info(f"[KuaishouCrawler.get_comments] begin get video_id: {video_id} comments ...")
                await self.ks_client.get_video_all_comments(
                    photo_id=video_id,
                    crawl_interval=random.random(),
                    callback=kuaishou_store.batch_update_ks_video_comments
                )
            except DataFetchError as ex:
                utils.logger.error(f"[KuaishouCrawler.get_comments] get video_id: {video_id} comment error: {ex}")
            except Exception as e:
                utils.logger.error(f"[KuaishouCrawler.get_comments] may be been blocked, err:{e}")
                # use time.sleeep block main coroutine instead of asyncio.sleep and cacel running comment task
                # maybe kuaishou block our request, we will take a nap and update the cookie again
                current_running_tasks = comment_tasks_var.get()
                for task in current_running_tasks:
                    task.cancel()
                time.sleep(20)
                await self.context_page.goto(f"{self.index_url}?isHome=1")
                await self.ks_client.update_cookies(browser_context=self.browser_context)
    @staticmethod
    def format_proxy_info(ip_proxy_info: IpInfoModel) -> Tuple[Optional[Dict], Optional[Dict]]:
        """format proxy info for playwright and httpx"""
        playwright_proxy = {
            "server": f"{ip_proxy_info.protocol}{ip_proxy_info.ip}:{ip_proxy_info.port}",
            "username": ip_proxy_info.user,
            "password": ip_proxy_info.password,
        }
        httpx_proxy = {
            f"{ip_proxy_info.protocol}": f"http://{ip_proxy_info.user}:{ip_proxy_info.password}@{ip_proxy_info.ip}:{ip_proxy_info.port}"
        }
        return playwright_proxy, httpx_proxy
    async def create_ks_client(self, httpx_proxy: Optional[str]) -> KuaiShouClient:
        """Create ks client"""
        utils.logger.info("[KuaishouCrawler.create_ks_client] Begin create kuaishou API client ...")
        cookie_str, cookie_dict = utils.convert_cookies(await self.browser_context.cookies())
        ks_client_obj = KuaiShouClient(
            proxies=httpx_proxy,
            headers={
                "User-Agent": self.user_agent,
                "Cookie": cookie_str,
                "Origin": self.index_url,
                "Referer": self.index_url,
                "Content-Type": "application/json;charset=UTF-8"
            },
            playwright_page=self.context_page,
            cookie_dict=cookie_dict,
        )
        return ks_client_obj
    async def launch_browser(
            self,
            chromium: BrowserType,
            playwright_proxy: Optional[Dict],
            user_agent: Optional[str],
            headless: bool = True
    ) -> BrowserContext:
        """Launch browser and create browser context"""
        utils.logger.info("[KuaishouCrawler.launch_browser] Begin create browser context ...")
        if config.SAVE_LOGIN_STATE:
            user_data_dir = os.path.join(os.getcwd(), "browser_data",
                                         config.USER_DATA_DIR % config.PLATFORM)  # type: ignore
            browser_context = await chromium.launch_persistent_context(
                user_data_dir=user_data_dir,
                accept_downloads=True,
                headless=headless,
                proxy=playwright_proxy,  # type: ignore
                viewport={"width": 1920, "height": 1080},
                user_agent=user_agent
            )
            return browser_context
        else:
            browser = await chromium.launch(headless=headless, proxy=playwright_proxy)  # type: ignore
            browser_context = await browser.new_context(
                viewport={"width": 1920, "height": 1080},
                user_agent=user_agent
            )
            return browser_context
    async def get_creators_and_videos(self) -> None:
        """Get creator's videos and retrieve their comment information."""
        utils.logger.info("[KuaiShouCrawler.get_creators_and_videos] Begin get kuaishou creators")
        for user_id in config.KS_CREATOR_ID_LIST:
            # get creator detail info from web html content
            createor_info: Dict = await self.ks_client.get_creator_info(user_id=user_id)
            if createor_info:
                await kuaishou_store.save_creator(user_id, creator=createor_info)
            # Get all video information of the creator
            all_video_list = await self.ks_client.get_all_videos_by_creator(
                user_id = user_id,
                crawl_interval = random.random(),
                callback = self.fetch_creator_video_detail
            )
            video_ids = [video_item.get("photo", {}).get("id") for video_item in all_video_list]
            await self.batch_get_video_comments(video_ids)
    async def fetch_creator_video_detail(self, video_list: List[Dict]):
        """
        Concurrently obtain the specified post list and save the data
        """
        semaphore = asyncio.Semaphore(config.MAX_CONCURRENCY_NUM)
        task_list = [
            self.get_video_info_task(post_item.get("photo", {}).get("id"), semaphore) for post_item in video_list
        ]
        video_details = await asyncio.gather(*task_list)
        for video_detail in video_details:
            if video_detail is not None:
                await kuaishou_store.update_kuaishou_video(video_detail)
    async def close(self):
        """Close browser context"""
        await self.browser_context.close()
        utils.logger.info("[KuaishouCrawler.close] Browser context closed ...")
--- a/media_platform/kuaishou/exception.py
+++ b/media_platform/kuaishou/exception.py
@ -0,0 +1,9 @@
 from httpx import RequestError
 class DataFetchError(RequestError):
    """something error when fetch"""
 class IPBlockError(RequestError):
    """fetch so fast that the server block us ip"""
--- a/media_platform/kuaishou/field.py
+++ b/media_platform/kuaishou/field.py
@ -0,0 +1 @@
 # -*- coding: utf-8 -*-
--- a/media_platform/kuaishou/graphql.py
+++ b/media_platform/kuaishou/graphql.py
@ -0,0 +1,22 @@
 # 快手的数据传输是基于GraphQL实现的
 # 这个类负责获取一些GraphQL的schema
 from typing import Dict
 class KuaiShouGraphQL:
    graphql_queries: Dict[str, str]= {}
    def __init__(self):
        self.graphql_dir = "media_platform/kuaishou/graphql/"
        self.load_graphql_queries()
    def load_graphql_queries(self):
        graphql_files = ["search_query.graphql", "video_detail.graphql", "comment_list.graphql", "vision_profile.graphql","vision_profile_photo_list.graphql","vision_profile_user_list.graphql","vision_sub_comment_list.graphql"]
        for file in graphql_files:
            with open(self.graphql_dir + file, mode="r") as f:
                query_name = file.split(".")[0]
                self.graphql_queries[query_name] = f.read()
    def get(self, query_name: str) -> str:
        return self.graphql_queries.get(query_name, "Query not found")
--- a/media_platform/kuaishou/graphql/comment_list.graphql
+++ b/media_platform/kuaishou/graphql/comment_list.graphql
@ -0,0 +1,39 @@
 query commentListQuery($photoId: String, $pcursor: String) {
  visionCommentList(photoId: $photoId, pcursor: $pcursor) {
    commentCount
    pcursor
    rootComments {
      commentId
      authorId
      authorName
      content
      headurl
      timestamp
      likedCount
      realLikedCount
      liked
      status
      authorLiked
      subCommentCount
      subCommentsPcursor
      subComments {
        commentId
        authorId
        authorName
        content
        headurl
        timestamp
        likedCount
        realLikedCount
        liked
        status
        authorLiked
        replyToUserName
        replyTo
        __typename
      }
      __typename
    }
    __typename
  }
 }
--- a/media_platform/kuaishou/graphql/search_query.graphql
+++ b/media_platform/kuaishou/graphql/search_query.graphql
@ -0,0 +1,111 @@
 fragment photoContent on PhotoEntity {
  __typename
  id
  duration
  caption
  originCaption
  likeCount
  viewCount
  commentCount
  realLikeCount
  coverUrl
  photoUrl
  photoH265Url
  manifest
  manifestH265
  videoResource
  coverUrls {
    url
    __typename
  }
  timestamp
  expTag
  animatedCoverUrl
  distance
  videoRatio
  liked
  stereoType
  profileUserTopPhoto
  musicBlocked
 }
 fragment recoPhotoFragment on recoPhotoEntity {
  __typename
  id
  duration
  caption
  originCaption
  likeCount
  viewCount
  commentCount
  realLikeCount
  coverUrl
  photoUrl
  photoH265Url
  manifest
  manifestH265
  videoResource
  coverUrls {
    url
    __typename
  }
  timestamp
  expTag
  animatedCoverUrl
  distance
  videoRatio
  liked
  stereoType
  profileUserTopPhoto
  musicBlocked
 }
 fragment feedContent on Feed {
  type
  author {
    id
    name
    headerUrl
    following
    headerUrls {
      url
      __typename
    }
    __typename
  }
  photo {
    ...photoContent
    ...recoPhotoFragment
    __typename
  }
  canAddComment
  llsid
  status
  currentPcursor
  tags {
    type
    name
    __typename
  }
  __typename
 }
 query visionSearchPhoto($keyword: String, $pcursor: String, $searchSessionId: String, $page: String, $webPageArea: String) {
  visionSearchPhoto(keyword: $keyword, pcursor: $pcursor, searchSessionId: $searchSessionId, page: $page, webPageArea: $webPageArea) {
    result
    llsid
    webPageArea
    feeds {
      ...feedContent
      __typename
    }
    searchSessionId
    pcursor
    aladdinBanner {
      imgUrl
      link
      __typename
    }
    __typename
  }
 }
--- a/media_platform/kuaishou/graphql/video_detail.graphql
+++ b/media_platform/kuaishou/graphql/video_detail.graphql
@ -0,0 +1,80 @@
 query visionVideoDetail($photoId: String, $type: String, $page: String, $webPageArea: String) {
  visionVideoDetail(photoId: $photoId, type: $type, page: $page, webPageArea: $webPageArea) {
    status
    type
    author {
      id
      name
      following
      headerUrl
      __typename
    }
    photo {
      id
      duration
      caption
      likeCount
      realLikeCount
      coverUrl
      photoUrl
      liked
      timestamp
      expTag
      llsid
      viewCount
      videoRatio
      stereoType
      musicBlocked
      manifest {
        mediaType
        businessType
        version
        adaptationSet {
          id
          duration
          representation {
            id
            defaultSelect
            backupUrl
            codecs
            url
            height
            width
            avgBitrate
            maxBitrate
            m3u8Slice
            qualityType
            qualityLabel
            frameRate
            featureP2sp
            hidden
            disableAdaptive
            __typename
          }
          __typename
        }
        __typename
      }
      manifestH265
      photoH265Url
      coronaCropManifest
      coronaCropManifestH265
      croppedPhotoH265Url
      croppedPhotoUrl
      videoResource
      __typename
    }
    tags {
      type
      name
      __typename
    }
    commentLimit {
      canAddComment
      __typename
    }
    llsid
    danmakuSwitch
    __typename
  }
 }
--- a/media_platform/kuaishou/graphql/vision_profile.graphql
+++ b/media_platform/kuaishou/graphql/vision_profile.graphql
@ -0,0 +1,27 @@
 query visionProfile($userId: String) {
  visionProfile(userId: $userId) {
    result
    hostName
    userProfile {
      ownerCount {
        fan
        photo
        follow
        photo_public
        __typename
      }
      profile {
        gender
        user_name
        user_id
        headurl
        user_text
        user_profile_bg_url
        __typename
      }
      isFollowing
      __typename
    }
    __typename
  }
 }
--- a/media_platform/kuaishou/graphql/vision_profile_photo_list.graphql
+++ b/media_platform/kuaishou/graphql/vision_profile_photo_list.graphql
@ -0,0 +1,110 @@
 fragment photoContent on PhotoEntity {
  __typename
  id
  duration
  caption
  originCaption
  likeCount
  viewCount
  commentCount
  realLikeCount
  coverUrl
  photoUrl
  photoH265Url
  manifest
  manifestH265
  videoResource
  coverUrls {
    url
    __typename
  }
  timestamp
  expTag
  animatedCoverUrl
  distance
  videoRatio
  liked
  stereoType
  profileUserTopPhoto
  musicBlocked
  riskTagContent
  riskTagUrl
 }
 fragment recoPhotoFragment on recoPhotoEntity {
  __typename
  id
  duration
  caption
  originCaption
  likeCount
  viewCount
  commentCount
  realLikeCount
  coverUrl
  photoUrl
  photoH265Url
  manifest
  manifestH265
  videoResource
  coverUrls {
    url
    __typename
  }
  timestamp
  expTag
  animatedCoverUrl
  distance
  videoRatio
  liked
  stereoType
  profileUserTopPhoto
  musicBlocked
  riskTagContent
  riskTagUrl
 }
 fragment feedContent on Feed {
  type
  author {
    id
    name
    headerUrl
    following
    headerUrls {
      url
      __typename
    }
    __typename
  }
  photo {
    ...photoContent
    ...recoPhotoFragment
    __typename
  }
  canAddComment
  llsid
  status
  currentPcursor
  tags {
    type
    name
    __typename
  }
  __typename
 }
 query visionProfilePhotoList($pcursor: String, $userId: String, $page: String, $webPageArea: String) {
  visionProfilePhotoList(pcursor: $pcursor, userId: $userId, page: $page, webPageArea: $webPageArea) {
    result
    llsid
    webPageArea
    feeds {
      ...feedContent
      __typename
    }
    hostName
    pcursor
    __typename
  }
 }
--- a/media_platform/kuaishou/graphql/vision_profile_user_list.graphql
+++ b/media_platform/kuaishou/graphql/vision_profile_user_list.graphql
@ -0,0 +1,16 @@
 query visionProfileUserList($pcursor: String, $ftype: Int) {
  visionProfileUserList(pcursor: $pcursor, ftype: $ftype) {
    result
    fols {
      user_name
      headurl
      user_text
      isFollowing
      user_id
      __typename
    }
    hostName
    pcursor
    __typename
  }
 }
--- a/media_platform/kuaishou/graphql/vision_sub_comment_list.graphql
+++ b/media_platform/kuaishou/graphql/vision_sub_comment_list.graphql
@ -0,0 +1,22 @@
 mutation visionSubCommentList($photoId: String, $rootCommentId: String, $pcursor: String) {
  visionSubCommentList(photoId: $photoId, rootCommentId: $rootCommentId, pcursor: $pcursor) {
    pcursor
    subComments {
      commentId
      authorId
      authorName
      content
      headurl
      timestamp
      likedCount
      realLikedCount
      liked
      status
      authorLiked
      replyToUserName
      replyTo
      __typename
    }
    __typename
  }
 }
--- a/media_platform/kuaishou/login.py
+++ b/media_platform/kuaishou/login.py
@ -0,0 +1,102 @@
 import asyncio
 import functools
 import sys
 from typing import Optional
 from playwright.async_api import BrowserContext, Page
 from tenacity import (RetryError, retry, retry_if_result, stop_after_attempt,
                      wait_fixed)
 import config
 from base.base_crawler import AbstractLogin
 from tools import utils
 class KuaishouLogin(AbstractLogin):
    def __init__(self,
                 login_type: str,
                 browser_context: BrowserContext,
                 context_page: Page,
                 login_phone: Optional[str] = "",
                 cookie_str: str = ""
                 ):
        config.LOGIN_TYPE = login_type
        self.browser_context = browser_context
        self.context_page = context_page
        self.login_phone = login_phone
        self.cookie_str = cookie_str
    async def begin(self):
        """Start login xiaohongshu"""
        utils.logger.info("[KuaishouLogin.begin] Begin login kuaishou ...")
        if config.LOGIN_TYPE == "qrcode":
            await self.login_by_qrcode()
        elif config.LOGIN_TYPE == "phone":
            await self.login_by_mobile()
        elif config.LOGIN_TYPE == "cookie":
            await self.login_by_cookies()
        else:
            raise ValueError("[KuaishouLogin.begin] Invalid Login Type Currently only supported qrcode or phone or cookie ...")
    @retry(stop=stop_after_attempt(600), wait=wait_fixed(1), retry=retry_if_result(lambda value: value is False))
    async def check_login_state(self) -> bool:
        """
            Check if the current login status is successful and return True otherwise return False
            retry decorator will retry 20 times if the return value is False, and the retry interval is 1 second
            if max retry times reached, raise RetryError
        """
        current_cookie = await self.browser_context.cookies()
        _, cookie_dict = utils.convert_cookies(current_cookie)
        kuaishou_pass_token = cookie_dict.get("passToken")
        if kuaishou_pass_token:
            return True
        return False
    async def login_by_qrcode(self):
        """login kuaishou website and keep webdriver login state"""
        utils.logger.info("[KuaishouLogin.login_by_qrcode] Begin login kuaishou by qrcode ...")
        # click login button
        login_button_ele = self.context_page.locator(
            "xpath=//p[text()='登录']"
        )
        await login_button_ele.click()
        # find login qrcode
        qrcode_img_selector = "//div[@class='qrcode-img']//img"
        base64_qrcode_img = await utils.find_login_qrcode(
            self.context_page,
            selector=qrcode_img_selector
        )
        if not base64_qrcode_img:
            utils.logger.info("[KuaishouLogin.login_by_qrcode] login failed , have not found qrcode please check ....")
            sys.exit()
        # show login qrcode
        partial_show_qrcode = functools.partial(utils.show_qrcode, base64_qrcode_img)
        asyncio.get_running_loop().run_in_executor(executor=None, func=partial_show_qrcode)
        utils.logger.info(f"[KuaishouLogin.login_by_qrcode] waiting for scan code login, remaining time is 20s")
        try:
            await self.check_login_state()
        except RetryError:
            utils.logger.info("[KuaishouLogin.login_by_qrcode] Login kuaishou failed by qrcode login method ...")
            sys.exit()
        wait_redirect_seconds = 5
        utils.logger.info(f"[KuaishouLogin.login_by_qrcode] Login successful then wait for {wait_redirect_seconds} seconds redirect ...")
        await asyncio.sleep(wait_redirect_seconds)
    async def login_by_mobile(self):
        pass
    async def login_by_cookies(self):
        utils.logger.info("[KuaishouLogin.login_by_cookies] Begin login kuaishou by cookie ...")
        for key, value in utils.convert_str_cookie_to_dict(self.cookie_str).items():
            await self.browser_context.add_cookies([{
                'name': key,
                'value': value,
                'domain': ".kuaishou.com",
                'path': "/"
            }])
--- a/media_platform/weibo/init.py
+++ b/media_platform/weibo/init.py
@ -0,0 +1,7 @@
 # -*- coding: utf-8 -*-
 # @Author  : relakkes@gmail.com
 # @Time    : 2023/12/23 15:40
 # @Desc    :
 from .client import WeiboClient
 from .core import WeiboCrawler
 from .login import WeiboLogin
--- a/media_platform/weibo/client.py
+++ b/media_platform/weibo/client.py
@ -0,0 +1,206 @@
 # -*- coding: utf-8 -*-
 # @Author  : relakkes@gmail.com
 # @Time    : 2023/12/23 15:40
 # @Desc    : 微博爬虫 API 请求 client
 import asyncio
 import copy
 import json
 import re
 from typing import Any, Callable, Dict, List, Optional
 from urllib.parse import urlencode
 import httpx
 from playwright.async_api import BrowserContext, Page
 from tools import utils
 from .exception import DataFetchError
 from .field import SearchType
 class WeiboClient:
    def __init__(
            self,
            timeout=10,
            proxies=None,
            *,
            headers: Dict[str, str],
            playwright_page: Page,
            cookie_dict: Dict[str, str],
    ):
        self.proxies = proxies
        self.timeout = timeout
        self.headers = headers
        self._host = "https://m.weibo.cn"
        self.playwright_page = playwright_page
        self.cookie_dict = cookie_dict
        self._image_agent_host = "https://i1.wp.com/"
    async def request(self, method, url, **kwargs) -> Any:
        async with httpx.AsyncClient(proxies=self.proxies) as client:
            response = await client.request(
                method, url, timeout=self.timeout,
                **kwargs
            )
        data: Dict = response.json()
        if data.get("ok") != 1:
            utils.logger.error(f"[WeiboClient.request] request {method}:{url} err, res:{data}")
            raise DataFetchError(data.get("msg", "unkonw error"))
        else:
            return data.get("data", {})
    async def get(self, uri: str, params=None, headers=None) -> Dict:
        final_uri = uri
        if isinstance(params, dict):
            final_uri = (f"{uri}?"
                         f"{urlencode(params)}")
        if headers is None:
            headers = self.headers
        return await self.request(method="GET", url=f"{self._host}{final_uri}", headers=headers)
    async def post(self, uri: str, data: dict) -> Dict:
        json_str = json.dumps(data, separators=(',', ':'), ensure_ascii=False)
        return await self.request(method="POST", url=f"{self._host}{uri}",
                                  data=json_str, headers=self.headers)
    async def pong(self) -> bool:
        """get a note to check if login state is ok"""
        utils.logger.info("[WeiboClient.pong] Begin pong weibo...")
        ping_flag = False
        try:
            uri  = "/api/config"
            resp_data: Dict = await self.request(method="GET", url=f"{self._host}{uri}", headers=self.headers)
            if resp_data.get("login"):
                ping_flag = True
            else:
                utils.logger.error(f"[WeiboClient.pong] cookie may be invalid and again login...")
        except Exception as e:
            utils.logger.error(f"[WeiboClient.pong] Pong weibo failed: {e}, and try to login again...")
            ping_flag = False
        return ping_flag
    async def update_cookies(self, browser_context: BrowserContext):
        cookie_str, cookie_dict = utils.convert_cookies(await browser_context.cookies())
        self.headers["Cookie"] = cookie_str
        self.cookie_dict = cookie_dict
    async def get_note_by_keyword(
            self,
            keyword: str,
            page: int = 1,
            search_type: SearchType = SearchType.DEFAULT
    ) -> Dict:
        """
        search note by keyword
        :param keyword: 微博搜搜的关键词
        :param page: 分页参数 -当前页码
        :param search_type: 搜索的类型，见 weibo/filed.py 中的枚举SearchType
        :return:
        """
        uri = "/api/container/getIndex"
        containerid = f"100103type={search_type.value}&q={keyword}"
        params = {
            "containerid": containerid,
            "page_type": "searchall",
            "page": page,
        }
        return await self.get(uri, params)
    async def get_note_comments(self, mid_id: str, max_id: int) -> Dict:
        """get notes comments
        :param mid_id: 微博ID
        :param max_id: 分页参数ID
        :return:
        """
        uri = "/comments/hotflow"
        params = {
            "id": mid_id,
            "mid": mid_id,
            "max_id_type": 0,
        }
        if max_id > 0:
            params.update({"max_id": max_id})
        referer_url = f"https://m.weibo.cn/detail/{mid_id}"
        headers = copy.copy(self.headers)
        headers["Referer"] = referer_url
        return await self.get(uri, params, headers=headers)
    async def get_note_all_comments(self, note_id: str, crawl_interval: float = 1.0, is_fetch_sub_comments=False,
                                    callback: Optional[Callable] = None, ):
        """
        get note all comments include sub comments
        :param note_id:
        :param crawl_interval:
        :param is_fetch_sub_comments:
        :param callback:
        :return:
        """
        result = []
        is_end = False
        max_id = -1
        while not is_end:
            comments_res = await self.get_note_comments(note_id, max_id)
            max_id: int = comments_res.get("max_id")
            comment_list: List[Dict] = comments_res.get("data", [])
            is_end = max_id == 0
            if callback:  # 如果有回调函数，就执行回调函数
                await callback(note_id, comment_list)
            await asyncio.sleep(crawl_interval)
            if not is_fetch_sub_comments:
                result.extend(comment_list)
                continue
            # todo handle get sub comments
        return result
    async def get_note_info_by_id(self, note_id: str) -> Dict:
        """
        根据帖子ID获取详情
        :param note_id:
        :return:
        """
        url = f"{self._host}/detail/{note_id}"
        async with httpx.AsyncClient(proxies=self.proxies) as client:
            response = await client.request(
                "GET", url, timeout=self.timeout, headers=self.headers
            )
            if response.status_code != 200:
                raise DataFetchError(f"get weibo detail err: {response.text}")
            match = re.search(r'var \$render_data = (\[.*?\])\[0\]', response.text, re.DOTALL)
            if match:
                render_data_json = match.group(1)
                render_data_dict = json.loads(render_data_json)
                note_detail = render_data_dict[0].get("status")
                note_item = {
                    "mblog": note_detail
                }
                return note_item
            else:
                utils.logger.info(f"[WeiboClient.get_note_info_by_id] 未找到$render_data的值")
                return dict()
    async def get_note_image(self, image_url: str) -> bytes:
        image_url = image_url[8:] # 去掉 https://
        sub_url = image_url.split("/")
        image_url = ""
        for i in range(len(sub_url)):
            if i == 1:
                image_url += "large/" #都获取高清大图
            elif i == len(sub_url) - 1:
                image_url += sub_url[i]
            else:
                image_url += sub_url[i] + "/"
        # 微博图床对外存在防盗链，所以需要代理访问
        # 由于微博图片是通过 i1.wp.com 来访问的，所以需要拼接一下
        final_uri = (f"{self._image_agent_host}" f"{image_url}")
        async with httpx.AsyncClient(proxies=self.proxies) as client:
            response = await client.request("GET", final_uri, timeout=self.timeout)
            if not response.reason_phrase == "OK":
                utils.logger.error(f"[WeiboClient.get_note_image] request {final_uri} err, res:{response.text}")
                return None
            else:
                return response.content
--- a/media_platform/weibo/core.py
+++ b/media_platform/weibo/core.py
@ -0,0 +1,283 @@
 # -*- coding: utf-8 -*-
 # @Author  : relakkes@gmail.com
 # @Time    : 2023/12/23 15:41
 # @Desc    : 微博爬虫主流程代码
 import asyncio
 import os
 import random
 from asyncio import Task
 from typing import Dict, List, Optional, Tuple
 from playwright.async_api import (BrowserContext, BrowserType, Page,
                                  async_playwright)
 import config
 from base.base_crawler import AbstractCrawler
 from proxy.proxy_ip_pool import IpInfoModel, create_ip_pool
 from store import weibo as weibo_store
 from tools import utils
 from var import crawler_type_var
 from .client import WeiboClient
 from .exception import DataFetchError
 from .field import SearchType
 from .help import filter_search_result_card
 from .login import WeiboLogin
 class WeiboCrawler(AbstractCrawler):
    context_page: Page
    wb_client: WeiboClient
    browser_context: BrowserContext
    def __init__(self):
        self.index_url = "https://www.weibo.com"
        self.mobile_index_url = "https://m.weibo.cn"
        self.user_agent = utils.get_user_agent()
        self.mobile_user_agent = utils.get_mobile_user_agent()
    async def start(self):
        playwright_proxy_format, httpx_proxy_format = None, None
        if config.ENABLE_IP_PROXY:
            ip_proxy_pool = await create_ip_pool(config.IP_PROXY_POOL_COUNT, enable_validate_ip=True)
            ip_proxy_info: IpInfoModel = await ip_proxy_pool.get_proxy()
            playwright_proxy_format, httpx_proxy_format = self.format_proxy_info(ip_proxy_info)
        async with async_playwright() as playwright:
            # Launch a browser context.
            chromium = playwright.chromium
            self.browser_context = await self.launch_browser(
                chromium,
                None,
                self.mobile_user_agent,
                headless=config.HEADLESS
            )
            # stealth.min.js is a js script to prevent the website from detecting the crawler.
            await self.browser_context.add_init_script(path="libs/stealth.min.js")
            self.context_page = await self.browser_context.new_page()
            await self.context_page.goto(self.mobile_index_url)
            # Create a client to interact with the xiaohongshu website.
            self.wb_client = await self.create_weibo_client(httpx_proxy_format)
            if not await self.wb_client.pong():
                login_obj = WeiboLogin(
                    login_type=config.LOGIN_TYPE,
                    login_phone="",  # your phone number
                    browser_context=self.browser_context,
                    context_page=self.context_page,
                    cookie_str=config.COOKIES
                )
                await self.context_page.goto(self.index_url)
                await asyncio.sleep(1)
                await login_obj.begin()
                # 登录成功后重定向到手机端的网站，再更新手机端登录成功的cookie
                utils.logger.info("[WeiboCrawler.start] redirect weibo mobile homepage and update cookies on mobile platform")
                await self.context_page.goto(self.mobile_index_url)
                await asyncio.sleep(2)
                await self.wb_client.update_cookies(browser_context=self.browser_context)
            crawler_type_var.set(config.CRAWLER_TYPE)
            if config.CRAWLER_TYPE == "search":
                # Search for video and retrieve their comment information.
                await self.search()
            elif config.CRAWLER_TYPE == "detail":
                # Get the information and comments of the specified post
                await self.get_specified_notes()
            else:
                pass
            utils.logger.info("[WeiboCrawler.start] Weibo Crawler finished ...")
    async def search(self):
        """
        search weibo note with keywords
        :return:
        """
        utils.logger.info("[WeiboCrawler.search] Begin search weibo keywords")
        weibo_limit_count = 10  # weibo limit page fixed value
        if config.CRAWLER_MAX_NOTES_COUNT < weibo_limit_count:
            config.CRAWLER_MAX_NOTES_COUNT = weibo_limit_count
        start_page = config.START_PAGE
        for keyword in config.KEYWORDS.split(","):
            utils.logger.info(f"[WeiboCrawler.search] Current search keyword: {keyword}")
            page = 1
            while (page - start_page + 1) * weibo_limit_count <= config.CRAWLER_MAX_NOTES_COUNT:
                if page < start_page:
                    utils.logger.info(f"[WeiboCrawler.search] Skip page: {page}")
                    page += 1
                    continue
                utils.logger.info(f"[WeiboCrawler.search] search weibo keyword: {keyword}, page: {page}")
                search_res = await self.wb_client.get_note_by_keyword(
                    keyword=keyword,
                    page=page,
                    search_type=SearchType.DEFAULT
                )
                note_id_list: List[str] = []
                note_list = filter_search_result_card(search_res.get("cards"))
                for note_item in note_list:
                    if note_item:
                        mblog: Dict = note_item.get("mblog")
                        if mblog:
                            note_id_list.append(mblog.get("id"))
                            await weibo_store.update_weibo_note(note_item)
                            await self.get_note_images(mblog)
                page += 1
                await self.batch_get_notes_comments(note_id_list)
    async def get_specified_notes(self):
        """
        get specified notes info
        :return:
        """
        semaphore = asyncio.Semaphore(config.MAX_CONCURRENCY_NUM)
        task_list = [
            self.get_note_info_task(note_id=note_id, semaphore=semaphore) for note_id in
            config.WEIBO_SPECIFIED_ID_LIST
        ]
        video_details = await asyncio.gather(*task_list)
        for note_item in video_details:
            if note_item:
                await weibo_store.update_weibo_note(note_item)
        await self.batch_get_notes_comments(config.WEIBO_SPECIFIED_ID_LIST)
    async def get_note_info_task(self, note_id: str, semaphore: asyncio.Semaphore) -> Optional[Dict]:
        """
        Get note detail task
        :param note_id:
        :param semaphore:
        :return:
        """
        async with semaphore:
            try:
                result = await self.wb_client.get_note_info_by_id(note_id)
                return result
            except DataFetchError as ex:
                utils.logger.error(f"[WeiboCrawler.get_note_info_task] Get note detail error: {ex}")
                return None
            except KeyError as ex:
                utils.logger.error(
                    f"[WeiboCrawler.get_note_info_task] have not fund note detail note_id:{note_id}, err: {ex}")
                return None
    async def batch_get_notes_comments(self, note_id_list: List[str]):
        """
        batch get notes comments
        :param note_id_list:
        :return:
        """
        if not config.ENABLE_GET_COMMENTS:
            utils.logger.info(f"[WeiboCrawler.batch_get_note_comments] Crawling comment mode is not enabled")
            return
        utils.logger.info(f"[WeiboCrawler.batch_get_notes_comments] note ids:{note_id_list}")
        semaphore = asyncio.Semaphore(config.MAX_CONCURRENCY_NUM)
        task_list: List[Task] = []
        for note_id in note_id_list:
            task = asyncio.create_task(self.get_note_comments(note_id, semaphore), name=note_id)
            task_list.append(task)
        await asyncio.gather(*task_list)
    async def get_note_comments(self, note_id: str, semaphore: asyncio.Semaphore):
        """
        get comment for note id
        :param note_id:
        :param semaphore:
        :return:
        """
        async with semaphore:
            try:
                utils.logger.info(f"[WeiboCrawler.get_note_comments] begin get note_id: {note_id} comments ...")
                await self.wb_client.get_note_all_comments(
                    note_id=note_id,
                    crawl_interval=random.randint(1,10), # 微博对API的限流比较严重，所以延时提高一些
                    callback=weibo_store.batch_update_weibo_note_comments
                )
            except DataFetchError as ex:
                utils.logger.error(f"[WeiboCrawler.get_note_comments] get note_id: {note_id} comment error: {ex}")
            except Exception as e:
                utils.logger.error(f"[WeiboCrawler.get_note_comments] may be been blocked, err:{e}")
    async def get_note_images(self, mblog: Dict):
        """
        get note images
        :param mblog:
        :return:
        """
        if not config.ENABLE_GET_IMAGES:
            utils.logger.info(f"[WeiboCrawler.get_note_images] Crawling image mode is not enabled")
            return
        pics: Dict = mblog.get("pics")
        if not pics:
            return
        for pic in pics:
            url = pic.get("url")
            if not url:
                continue
            content = await self.wb_client.get_note_image(url)
            if content != None:
                extension_file_name = url.split(".")[-1]
                await weibo_store.update_weibo_note_image(pic["pid"], content, extension_file_name)
    async def create_weibo_client(self, httpx_proxy: Optional[str]) -> WeiboClient:
        """Create xhs client"""
        utils.logger.info("[WeiboCrawler.create_weibo_client] Begin create weibo API client ...")
        cookie_str, cookie_dict = utils.convert_cookies(await self.browser_context.cookies())
        weibo_client_obj = WeiboClient(
            proxies=httpx_proxy,
            headers={
                "User-Agent": utils.get_mobile_user_agent(),
                "Cookie": cookie_str,
                "Origin": "https://m.weibo.cn",
                "Referer": "https://m.weibo.cn",
                "Content-Type": "application/json;charset=UTF-8"
            },
            playwright_page=self.context_page,
            cookie_dict=cookie_dict,
        )
        return weibo_client_obj
    @staticmethod
    def format_proxy_info(ip_proxy_info: IpInfoModel) -> Tuple[Optional[Dict], Optional[Dict]]:
        """format proxy info for playwright and httpx"""
        playwright_proxy = {
            "server": f"{ip_proxy_info.protocol}{ip_proxy_info.ip}:{ip_proxy_info.port}",
            "username": ip_proxy_info.user,
            "password": ip_proxy_info.password,
        }
        httpx_proxy = {
            f"{ip_proxy_info.protocol}": f"http://{ip_proxy_info.user}:{ip_proxy_info.password}@{ip_proxy_info.ip}:{ip_proxy_info.port}"
        }
        return playwright_proxy, httpx_proxy
    async def launch_browser(
            self,
            chromium: BrowserType,
            playwright_proxy: Optional[Dict],
            user_agent: Optional[str],
            headless: bool = True
    ) -> BrowserContext:
        """Launch browser and create browser context"""
        utils.logger.info("[WeiboCrawler.launch_browser] Begin create browser context ...")
        if config.SAVE_LOGIN_STATE:
            user_data_dir = os.path.join(os.getcwd(), "browser_data",
                                         config.USER_DATA_DIR % config.PLATFORM)  # type: ignore
            browser_context = await chromium.launch_persistent_context(
                user_data_dir=user_data_dir,
                accept_downloads=True,
                headless=headless,
                proxy=playwright_proxy,  # type: ignore
                viewport={"width": 1920, "height": 1080},
                user_agent=user_agent
            )
            return browser_context
        else:
            browser = await chromium.launch(headless=headless, proxy=playwright_proxy)  # type: ignore
            browser_context = await browser.new_context(
                viewport={"width": 1920, "height": 1080},
                user_agent=user_agent
            )
            return browser_context
--- a/media_platform/weibo/exception.py
+++ b/media_platform/weibo/exception.py
@ -0,0 +1,14 @@
 # -*- coding: utf-8 -*-
 # @Author  : relakkes@gmail.com
 # @Time    : 2023/12/2 18:44
 # @Desc    :
 from httpx import RequestError
 class DataFetchError(RequestError):
    """something error when fetch"""
 class IPBlockError(RequestError):
    """fetch so fast that the server block us ip"""
--- a/media_platform/weibo/field.py
+++ b/media_platform/weibo/field.py
@ -0,0 +1,19 @@
 # -*- coding: utf-8 -*-
 # @Author  : relakkes@gmail.com
 # @Time    : 2023/12/23 15:41
 # @Desc    :
 from enum import Enum
 class SearchType(Enum):
    # 综合
    DEFAULT = "1"
    # 实时
    REAL_TIME = "61"
    # 热门
    POPULAR = "60"
    # 视频
    VIDEO = "64"
--- a/media_platform/weibo/help.py
+++ b/media_platform/weibo/help.py
@ -0,0 +1,25 @@
 # -*- coding: utf-8 -*-
 # @Author  : relakkes@gmail.com
 # @Time    : 2023/12/24 17:37
 # @Desc    :
 from typing import Dict, List
 def filter_search_result_card(card_list: List[Dict]) -> List[Dict]:
    """
    过滤微博搜索的结果，只保留card_type为9类型的数据
    :param card_list:
    :return:
    """
    note_list: List[Dict] = []
    for card_item in card_list:
        if card_item.get("card_type") == 9:
            note_list.append(card_item)
        if len(card_item.get("card_group", [])) > 0:
            card_group = card_item.get("card_group")
            for card_group_item in card_group:
                if card_group_item.get("card_type") == 9:
                    note_list.append(card_group_item)
    return note_list
--- a/media_platform/weibo/login.py
+++ b/media_platform/weibo/login.py
@ -0,0 +1,137 @@
 # -*- coding: utf-8 -*-
 # @Author  : relakkes@gmail.com
 # @Time    : 2023/12/23 15:42
 # @Desc    : 微博登录实现
 import asyncio
 import functools
 import sys
 from typing import Optional
 from playwright.async_api import BrowserContext, Page
 from tenacity import (RetryError, retry, retry_if_result, stop_after_attempt,
                      wait_fixed)
 import config
 from base.base_crawler import AbstractLogin
 from tools import utils
 class WeiboLogin(AbstractLogin):
    def __init__(self,
                 login_type: str,
                 browser_context: BrowserContext,
                 context_page: Page,
                 login_phone: Optional[str] = "",
                 cookie_str: str = ""
                 ):
        config.LOGIN_TYPE = login_type
        self.browser_context = browser_context
        self.context_page = context_page
        self.login_phone = login_phone
        self.cookie_str = cookie_str
    async def begin(self):
        """Start login weibo"""
        utils.logger.info("[WeiboLogin.begin] Begin login weibo ...")
        if config.LOGIN_TYPE == "qrcode":
            await self.login_by_qrcode()
        elif config.LOGIN_TYPE == "phone":
            await self.login_by_mobile()
        elif config.LOGIN_TYPE == "cookie":
            await self.login_by_cookies()
        else:
            raise ValueError(
                "[WeiboLogin.begin] Invalid Login Type Currently only supported qrcode or phone or cookie ...")
    @retry(stop=stop_after_attempt(600), wait=wait_fixed(1), retry=retry_if_result(lambda value: value is False))
    async def check_login_state(self, no_logged_in_session: str) -> bool:
        """
            Check if the current login status is successful and return True otherwise return False
            retry decorator will retry 20 times if the return value is False, and the retry interval is 1 second
            if max retry times reached, raise RetryError
        """
        current_cookie = await self.browser_context.cookies()
        _, cookie_dict = utils.convert_cookies(current_cookie)
        current_web_session = cookie_dict.get("WBPSESS")
        if current_web_session != no_logged_in_session:
            return True
        return False
    async def popup_login_dialog(self):
        """If the login dialog box does not pop up automatically, we will manually click the login button"""
        dialog_selector = "xpath=//div[@class='woo-modal-main']"
        try:
            # check dialog box is auto popup and wait for 4 seconds
            await self.context_page.wait_for_selector(dialog_selector, timeout=1000 * 4)
        except Exception as e:
            utils.logger.error(
                f"[WeiboLogin.popup_login_dialog] login dialog box does not pop up automatically, error: {e}")
            utils.logger.info(
                "[WeiboLogin.popup_login_dialog] login dialog box does not pop up automatically, we will manually click the login button")
            # 向下滚动1000像素
            await self.context_page.mouse.wheel(0,500)
            await asyncio.sleep(0.5)
            try:
                # click login button
                login_button_ele = self.context_page.locator(
                    "xpath=//a[text()='登录']",
                )
                await login_button_ele.click()
                await asyncio.sleep(0.5)
            except Exception as e:
                utils.logger.info(f"[WeiboLogin.popup_login_dialog] manually click the login button faield maybe login dialog Appear：{e}")
    async def login_by_qrcode(self):
        """login weibo website and keep webdriver login state"""
        utils.logger.info("[WeiboLogin.login_by_qrcode] Begin login weibo by qrcode ...")
        await self.popup_login_dialog()
        # find login qrcode
        qrcode_img_selector = "//div[@class='woo-modal-main']//img"
        base64_qrcode_img = await utils.find_login_qrcode(
            self.context_page,
            selector=qrcode_img_selector
        )
        if not base64_qrcode_img:
            utils.logger.info("[WeiboLogin.login_by_qrcode] login failed , have not found qrcode please check ....")
            sys.exit()
        # show login qrcode
        partial_show_qrcode = functools.partial(utils.show_qrcode, base64_qrcode_img)
        asyncio.get_running_loop().run_in_executor(executor=None, func=partial_show_qrcode)
        utils.logger.info(f"[WeiboLogin.login_by_qrcode] Waiting for scan code login, remaining time is 20s")
        # get not logged session
        current_cookie = await self.browser_context.cookies()
        _, cookie_dict = utils.convert_cookies(current_cookie)
        no_logged_in_session = cookie_dict.get("WBPSESS")
        try:
            await self.check_login_state(no_logged_in_session)
        except RetryError:
            utils.logger.info("[WeiboLogin.login_by_qrcode] Login weibo failed by qrcode login method ...")
            sys.exit()
        wait_redirect_seconds = 5
        utils.logger.info(
            f"[WeiboLogin.login_by_qrcode] Login successful then wait for {wait_redirect_seconds} seconds redirect ...")
        await asyncio.sleep(wait_redirect_seconds)
    async def login_by_mobile(self):
        pass
    async def login_by_cookies(self):
        utils.logger.info("[WeiboLogin.login_by_qrcode] Begin login weibo by cookie ...")
        for key, value in utils.convert_str_cookie_to_dict(self.cookie_str).items():
            await self.browser_context.add_cookies([{
                'name': key,
                'value': value,
                'domain': ".weibo.cn",
                'path': "/"
            }])
--- a/media_platform/xhs/init.py
+++ b/media_platform/xhs/init.py
@ -0,0 +1,2 @@
 from .core import XiaoHongShuCrawler
 from .field import *
--- a/media_platform/xhs/client.py
+++ b/media_platform/xhs/client.py
@ -0,0 +1,419 @@
 import asyncio
 import json
 import re
 from typing import Any, Callable, Dict, List, Optional, Union
 from urllib.parse import urlencode
 from bs4 import BeautifulSoup
 import httpx
 from playwright.async_api import BrowserContext, Page
 import config
 from base.base_crawler import AbstractApiClient
 from tools import utils
 from .exception import DataFetchError, IPBlockError
 from .field import SearchNoteType, SearchSortType
 from .help import get_search_id, sign
 class XiaoHongShuClient(AbstractApiClient):
    def __init__(
            self,
            timeout=10,
            proxies=None,
            *,
            headers: Dict[str, str],
            playwright_page: Page,
            cookie_dict: Dict[str, str],
    ):
        self.proxies = proxies
        self.timeout = timeout
        self.headers = headers
        self._host = "https://edith.xiaohongshu.com"
        self._domain = "https://www.xiaohongshu.com"
        self.IP_ERROR_STR = "网络连接异常，请检查网络设置或重启试试"
        self.IP_ERROR_CODE = 300012
        self.NOTE_ABNORMAL_STR = "笔记状态异常，请稍后查看"
        self.NOTE_ABNORMAL_CODE = -510001
        self.playwright_page = playwright_page
        self.cookie_dict = cookie_dict
    async def _pre_headers(self, url: str, data=None) -> Dict:
        """
        请求头参数签名
        Args:
            url:
            data:
        Returns:
        """
        encrypt_params = await self.playwright_page.evaluate("([url, data]) => window._webmsxyw(url,data)", [url, data])
        local_storage = await self.playwright_page.evaluate("() => window.localStorage")
        signs = sign(
            a1=self.cookie_dict.get("a1", ""),
            b1=local_storage.get("b1", ""),
            x_s=encrypt_params.get("X-s", ""),
            x_t=str(encrypt_params.get("X-t", ""))
        )
        headers = {
            "X-S": signs["x-s"],
            "X-T": signs["x-t"],
            "x-S-Common": signs["x-s-common"],
            "X-B3-Traceid": signs["x-b3-traceid"]
        }
        self.headers.update(headers)
        return self.headers
    async def request(self, method, url, **kwargs) -> Union[str, Any]:
        """
        封装httpx的公共请求方法，对请求响应做一些处理
        Args:
            method: 请求方法
            url: 请求的URL
            **kwargs: 其他请求参数，例如请求头、请求体等
        Returns:
        """
        # return response.text
        return_response = kwargs.pop('return_response', False)
        async with httpx.AsyncClient(proxies=self.proxies) as client:
            response = await client.request(
                method, url, timeout=self.timeout,
                **kwargs
            )
        if return_response:
            return response.text
        data: Dict = response.json()
        if data["success"]:
            return data.get("data", data.get("success", {}))
        elif data["code"] == self.IP_ERROR_CODE:
            raise IPBlockError(self.IP_ERROR_STR)
        else:
            raise DataFetchError(data.get("msg", None))
    async def get(self, uri: str, params=None) -> Dict:
        """
        GET请求，对请求头签名
        Args:
            uri: 请求路由
            params: 请求参数
        Returns:
        """
        final_uri = uri
        if isinstance(params, dict):
            final_uri = (f"{uri}?"
                         f"{urlencode(params)}")
        headers = await self._pre_headers(final_uri)
        return await self.request(method="GET", url=f"{self._host}{final_uri}", headers=headers)
    async def post(self, uri: str, data: dict) -> Dict:
        """
        POST请求，对请求头签名
        Args:
            uri: 请求路由
            data: 请求体参数
        Returns:
        """
        headers = await self._pre_headers(uri, data)
        json_str = json.dumps(data, separators=(',', ':'), ensure_ascii=False)
        return await self.request(method="POST", url=f"{self._host}{uri}",
                                  data=json_str, headers=headers)
    async def pong(self) -> bool:
        """
        用于检查登录态是否失效了
        Returns:
        """
        """get a note to check if login state is ok"""
        utils.logger.info("[XiaoHongShuClient.pong] Begin to pong xhs...")
        ping_flag = False
        try:
            note_card: Dict = await self.get_note_by_keyword(keyword="小红书")
            if note_card.get("items"):
                ping_flag = True
        except Exception as e:
            utils.logger.error(f"[XiaoHongShuClient.pong] Ping xhs failed: {e}, and try to login again...")
            ping_flag = False
        return ping_flag
    async def update_cookies(self, browser_context: BrowserContext):
        """
        API客户端提供的更新cookies方法，一般情况下登录成功后会调用此方法
        Args:
            browser_context: 浏览器上下文对象
        Returns:
        """
        cookie_str, cookie_dict = utils.convert_cookies(await browser_context.cookies())
        self.headers["Cookie"] = cookie_str
        self.cookie_dict = cookie_dict
    async def get_note_by_keyword(
            self, keyword: str,
            page: int = 1, page_size: int = 20,
            sort: SearchSortType = SearchSortType.GENERAL,
            note_type: SearchNoteType = SearchNoteType.ALL
    ) -> Dict:
        """
        根据关键词搜索笔记
        Args:
            keyword: 关键词参数
            page: 分页第几页
            page_size: 分页数据长度
            sort: 搜索结果排序指定
            note_type: 搜索的笔记类型
        Returns:
        """
        uri = "/api/sns/web/v1/search/notes"
        data = {
            "keyword": keyword,
            "page": page,
            "page_size": page_size,
            "search_id": get_search_id(),
            "sort": sort.value,
            "note_type": note_type.value
        }
        return await self.post(uri, data)
    async def get_note_by_id(self, note_id: str) -> Dict:
        """
        获取笔记详情API
        Args:
            note_id:笔记ID
        Returns:
        """
        data = {"source_note_id": note_id}
        uri = "/api/sns/web/v1/feed"
        res = await self.post(uri, data)
        if res and res.get("items"):
            res_dict: Dict = res["items"][0]["note_card"]
            return res_dict
        utils.logger.error(f"[XiaoHongShuClient.get_note_by_id] get note empty and res:{res}")
        return dict()
    async def get_note_comments(self, note_id: str, cursor: str = "") -> Dict:
        """
        获取一级评论的API
        Args:
            note_id: 笔记ID
            cursor: 分页游标
        Returns:
        """
        uri = "/api/sns/web/v2/comment/page"
        params = {
            "note_id": note_id,
            "cursor": cursor,
            "top_comment_id": "",
            "image_formats": "jpg,webp,avif"
        }
        return await self.get(uri, params)
    async def get_note_sub_comments(self, note_id: str, root_comment_id: str, num: int = 10, cursor: str = ""):
        """
        获取指定父评论下的子评论的API
        Args:
            note_id: 子评论的帖子ID
            root_comment_id: 根评论ID
            num: 分页数量
            cursor: 分页游标
        Returns:
        """
        uri = "/api/sns/web/v2/comment/sub/page"
        params = {
            "note_id": note_id,
            "root_comment_id": root_comment_id,
            "num": num,
            "cursor": cursor,
        }
        return await self.get(uri, params)
    async def get_note_all_comments(self, note_id: str, crawl_interval: float = 1.0,
                                    callback: Optional[Callable] = None) -> List[Dict]:
        """
        获取指定笔记下的所有一级评论，该方法会一直查找一个帖子下的所有评论信息
        Args:
            note_id: 笔记ID
            crawl_interval: 爬取一次笔记的延迟单位（秒）
            callback: 一次笔记爬取结束后
        Returns:
        """
        result = []
        comments_has_more = True
        comments_cursor = ""
        while comments_has_more:
            comments_res = await self.get_note_comments(note_id, comments_cursor)
            comments_has_more = comments_res.get("has_more", False)
            comments_cursor = comments_res.get("cursor", "")
            if "comments" not in comments_res:
                utils.logger.info(
                    f"[XiaoHongShuClient.get_note_all_comments] No 'comments' key found in response: {comments_res}")
                break
            comments = comments_res["comments"]
            if callback:
                await callback(note_id, comments)
            await asyncio.sleep(crawl_interval)
            result.extend(comments)
            sub_comments = await self.get_comments_all_sub_comments(comments, crawl_interval, callback)
            result.extend(sub_comments)
        return result
    async def get_comments_all_sub_comments(self, comments: List[Dict], crawl_interval: float = 1.0,
                                    callback: Optional[Callable] = None) -> List[Dict]:
        """
        获取指定一级评论下的所有二级评论, 该方法会一直查找一级评论下的所有二级评论信息
        Args:
            comments: 评论列表
            crawl_interval: 爬取一次评论的延迟单位（秒）
            callback: 一次评论爬取结束后
        Returns:
        """
        if not config.ENABLE_GET_SUB_COMMENTS:
            utils.logger.info(f"[XiaoHongShuCrawler.get_comments_all_sub_comments] Crawling sub_comment mode is not enabled")
            return []
        result = []
        for comment in comments:
            note_id = comment.get("note_id")
            sub_comments = comment.get("sub_comments")
            if sub_comments and callback:
                await callback(note_id, sub_comments)
            sub_comment_has_more = comment.get("sub_comment_has_more")
            if not sub_comment_has_more:
                continue
            root_comment_id = comment.get("id")
            sub_comment_cursor = comment.get("sub_comment_cursor")
            while sub_comment_has_more:
                comments_res = await self.get_note_sub_comments(note_id, root_comment_id, 10, sub_comment_cursor)
                sub_comment_has_more = comments_res.get("has_more", False)
                sub_comment_cursor = comments_res.get("cursor", "")
                if "comments" not in comments_res:
                    utils.logger.info(
                        f"[XiaoHongShuClient.get_comments_all_sub_comments] No 'comments' key found in response: {comments_res}")
                    break
                comments = comments_res["comments"]
                if callback:
                    await callback(note_id, comments)
                await asyncio.sleep(crawl_interval)
                result.extend(comments)
        return result
    async def get_explore_id(self) -> list:
        uri = f"/explore"
        html_content = await self.request("GET", self._domain + uri, return_response=True, headers=self.headers)
        soup = BeautifulSoup(html_content, 'html.parser')
        div = soup.find('div', class_='feeds-container')  
        section_list = div.find_all('section')
        explore_id = []
        for s in section_list:
            a = s.find('a')
            id_url = a['href']
            tmp_list = id_url.split('/')
            assert len(tmp_list) == 3
            id = tmp_list[2]
            explore_id.append(id)
        return explore_id
    async def get_creator_info(self, user_id: str) -> Dict:
        """
        通过解析网页版的用户主页HTML，获取用户个人简要信息
        PC端用户主页的网页存在window.__INITIAL_STATE__这个变量上的，解析它即可
        eg: https://www.xiaohongshu.com/user/profile/59d8cb33de5fb4696bf17217
        """
        uri = f"/user/profile/{user_id}"
        html_content = await self.request("GET", self._domain + uri, return_response=True, headers=self.headers)
        match = re.search(r'<script>window.__INITIAL_STATE__=(.+)<\/script>', html_content, re.M)
        if match is None:
            return {}
        info = json.loads(match.group(1).replace(':undefined', ':null'), strict=False)
        if info is None:
            return {}
        return info.get('user').get('userPageData')
    async def get_notes_by_creator(
            self, creator: str,
            cursor: str,
            page_size: int = 30
    ) -> Dict:
        """
        获取博主的笔记
        Args:
            creator: 博主ID
            cursor: 上一页最后一条笔记的ID
            page_size: 分页数据长度
        Returns:
        """
        uri = "/api/sns/web/v1/user_posted"
        data = {
            "user_id": creator,
            "cursor": cursor,
            "num": page_size,
            "image_formats": "jpg,webp,avif"
        }
        return await self.get(uri, data)
    async def get_all_notes_by_creator(self, user_id: str, crawl_interval: float = 1.0,
                                       callback: Optional[Callable] = None) -> List[Dict]:
        """
        获取指定用户下的所有发过的帖子，该方法会一直查找一个用户下的所有帖子信息
        Args:
            user_id: 用户ID
            crawl_interval: 爬取一次的延迟单位（秒）
            callback: 一次分页爬取结束后的更新回调函数
        Returns:
        """
        result = []
        notes_has_more = True
        notes_cursor = ""
        while notes_has_more:
            notes_res = await self.get_notes_by_creator(user_id, notes_cursor)
            if not notes_res:
                utils.logger.error(f"[XiaoHongShuClient.get_notes_by_creator] The current creator may have been banned by xhs, so they cannot access the data.")
                break
            notes_has_more = notes_res.get("has_more", False)
            notes_cursor = notes_res.get("cursor", "")
            if "notes" not in notes_res:
                utils.logger.info(f"[XiaoHongShuClient.get_all_notes_by_creator] No 'notes' key found in response: {notes_res}")
                break
            notes = notes_res["notes"]
            utils.logger.info(f"[XiaoHongShuClient.get_all_notes_by_creator] got user_id:{user_id} notes len : {len(notes)}")
            if callback:
                await callback(notes)
            await asyncio.sleep(crawl_interval)
            result.extend(notes)
        return result
--- a/media_platform/xhs/core.py
+++ b/media_platform/xhs/core.py
@ -0,0 +1,300 @@
 import asyncio
 import os
 import random
 from asyncio import Task
 import time
 from typing import Dict, List, Optional, Tuple
 from playwright.async_api import (BrowserContext, BrowserType, Page,
                                  async_playwright)
 import config
 from base.base_crawler import AbstractCrawler
 from proxy.proxy_ip_pool import IpInfoModel, create_ip_pool
 from store import xhs as xhs_store
 from tools import utils
 from var import crawler_type_var
 from .client import XiaoHongShuClient
 from .exception import DataFetchError
 from .field import SearchSortType
 from .login import XiaoHongShuLogin
 class XiaoHongShuCrawler(AbstractCrawler):
    context_page: Page
    xhs_client: XiaoHongShuClient
    browser_context: BrowserContext
    def __init__(self) -> None:
        self.index_url = "https://www.xiaohongshu.com"
        self.user_agent = utils.get_user_agent()
    async def start(self) -> None:
        playwright_proxy_format, httpx_proxy_format = None, None
        if config.ENABLE_IP_PROXY:
            ip_proxy_pool = await create_ip_pool(config.IP_PROXY_POOL_COUNT, enable_validate_ip=True)
            ip_proxy_info: IpInfoModel = await ip_proxy_pool.get_proxy()
            playwright_proxy_format, httpx_proxy_format = self.format_proxy_info(ip_proxy_info)
        async with async_playwright() as playwright:
            # Launch a browser context.
            chromium = playwright.chromium
            self.browser_context = await self.launch_browser(
                chromium,
                None,
                self.user_agent,
                headless=config.HEADLESS
            )
            # stealth.min.js is a js script to prevent the website from detecting the crawler.
            await self.browser_context.add_init_script(path="libs/stealth.min.js")
            # add a cookie attribute webId to avoid the appearance of a sliding captcha on the webpage
            await self.browser_context.add_cookies([{
                'name': "webId",
                'value': "xxx123",  # any value
                'domain': ".xiaohongshu.com",
                'path': "/"
            }])
            self.context_page = await self.browser_context.new_page()
            await self.context_page.goto(self.index_url)
            # Create a client to interact with the xiaohongshu website.
            self.xhs_client = await self.create_xhs_client(httpx_proxy_format)
            if not await self.xhs_client.pong():
                login_obj = XiaoHongShuLogin(
                    login_type=config.LOGIN_TYPE,
                    login_phone="",  # input your phone number
                    browser_context=self.browser_context,
                    context_page=self.context_page,
                    cookie_str=config.COOKIES
                )
                await login_obj.begin()
                await self.xhs_client.update_cookies(browser_context=self.browser_context)
            crawler_type_var.set(config.CRAWLER_TYPE)
            if config.CRAWLER_TYPE == "search":
                # Search for notes and retrieve their comment information.
                await self.search()
            elif config.CRAWLER_TYPE == "detail":
                # Get the information and comments of the specified post
                await self.get_specified_notes()
            elif config.CRAWLER_TYPE == "creator":
                # Get creator's information and their notes and comments
                await self.get_creators_and_notes()
            elif config.CRAWLER_TYPE == "explore": 
                await self.get_explore()
            else:
                pass
            utils.logger.info("[XiaoHongShuCrawler.start] Xhs Crawler finished ...")
    async def search(self) -> None:
        """Search for notes and retrieve their comment information."""
        utils.logger.info("[XiaoHongShuCrawler.search] Begin search xiaohongshu keywords")
        xhs_limit_count = 20  # xhs limit page fixed value
        if config.CRAWLER_MAX_NOTES_COUNT < xhs_limit_count:
            config.CRAWLER_MAX_NOTES_COUNT = xhs_limit_count
        start_page = config.START_PAGE
        for keyword in config.KEYWORDS.split(","):
            utils.logger.info(f"[XiaoHongShuCrawler.search] Current search keyword: {keyword}")
            page = 1
            while (page - start_page + 1) * xhs_limit_count <= config.CRAWLER_MAX_NOTES_COUNT:
                if page < start_page:
                    utils.logger.info(f"[XiaoHongShuCrawler.search] Skip page {page}")
                    page += 1
                    continue
                try:
                    utils.logger.info(f"[XiaoHongShuCrawler.search] search xhs keyword: {keyword}, page: {page}")
                    note_id_list: List[str] = []
                    notes_res = await self.xhs_client.get_note_by_keyword(
                        keyword=keyword,
                        page=page,
                        sort=SearchSortType(config.SORT_TYPE) if config.SORT_TYPE != '' else SearchSortType.GENERAL,
                    )
                    utils.logger.info(f"[XiaoHongShuCrawler.search] Search notes res:{notes_res}")
                    semaphore = asyncio.Semaphore(config.MAX_CONCURRENCY_NUM)
                    task_list = [
                        self.get_note_detail(post_item.get("id"), semaphore)
                        for post_item in notes_res.get("items", {})
                        if post_item.get('model_type') not in ('rec_query', 'hot_query')
                    ]
                    note_details = await asyncio.gather(*task_list)
                    for note_detail in note_details:
                        if note_detail is not None:
                            await xhs_store.update_xhs_note(note_detail)
                            note_id_list.append(note_detail.get("note_id"))
                    page += 1
                    utils.logger.info(f"[XiaoHongShuCrawler.search] Note details: {note_details}")
                    await self.batch_get_note_comments(note_id_list)
                except DataFetchError:
                    utils.logger.error("[XiaoHongShuCrawler.search] Get note detail error")
                    break
    async def get_explore(self) -> None:
        explore_id = await self.xhs_client.get_explore_id()
        print("[+]GET explore content:")
        for id in explore_id:
            note_info = await self.xhs_client.get_note_by_id(id)
            ip_location = note_info['ip_location']
            last_update_time = str(note_info['last_update_time'])
            user_name = note_info['user']['nickname']
            user_id = note_info['user']['user_id']
            content = note_info['desc']
            timeArray = time.localtime(int(last_update_time[:-3]))
            otherStyleTime = time.strftime("%Y-%m-%d %H:%M:%S", timeArray)
            if(len(content) <= 40):
                show = content.replace("\n","\\n")
            else:
                show = content[:40].replace("\n","\\n") + "..."
            print(f"[*]IP:{ip_location},Update Time:{otherStyleTime},User Name:{user_name},Content:{show}")
    async def get_creators_and_notes(self) -> None:
        """Get creator's notes and retrieve their comment information."""
        utils.logger.info("[XiaoHongShuCrawler.get_creators_and_notes] Begin get xiaohongshu creators")
        for user_id in config.XHS_CREATOR_ID_LIST:
            # get creator detail info from web html content
            createor_info: Dict = await self.xhs_client.get_creator_info(user_id=user_id)
            if createor_info:
                await xhs_store.save_creator(user_id, creator=createor_info)
            # Get all note information of the creator
            all_notes_list = await self.xhs_client.get_all_notes_by_creator(
                user_id=user_id,
                crawl_interval=random.random(),
                callback=self.fetch_creator_notes_detail
            )
            note_ids = [note_item.get("note_id") for note_item in all_notes_list]
            # print("note_ids:",note_ids)
            await self.batch_get_note_comments(note_ids)
    async def fetch_creator_notes_detail(self, note_list: List[Dict]):
        """
        Concurrently obtain the specified post list and save the data
        """
        semaphore = asyncio.Semaphore(config.MAX_CONCURRENCY_NUM)
        task_list = [
            self.get_note_detail(post_item.get("note_id"), semaphore) for post_item in note_list
        ]
        note_details = await asyncio.gather(*task_list)
        for note_detail in note_details:
            if note_detail is not None:
                await xhs_store.update_xhs_note(note_detail)
    async def get_specified_notes(self):
        """Get the information and comments of the specified post"""
        semaphore = asyncio.Semaphore(config.MAX_CONCURRENCY_NUM)
        task_list = [
            self.get_note_detail(note_id=note_id, semaphore=semaphore) for note_id in config.XHS_SPECIFIED_ID_LIST
        ]
        note_details = await asyncio.gather(*task_list)
        for note_detail in note_details:
            if note_detail is not None:
                await xhs_store.update_xhs_note(note_detail)
        await self.batch_get_note_comments(config.XHS_SPECIFIED_ID_LIST)
    async def get_note_detail(self, note_id: str, semaphore: asyncio.Semaphore) -> Optional[Dict]:
        """Get note detail"""
        async with semaphore:
            try:
                return await self.xhs_client.get_note_by_id(note_id)
            except DataFetchError as ex:
                utils.logger.error(f"[XiaoHongShuCrawler.get_note_detail] Get note detail error: {ex}")
                return None
            except KeyError as ex:
                utils.logger.error(
                    f"[XiaoHongShuCrawler.get_note_detail] have not fund note detail note_id:{note_id}, err: {ex}")
                return None
    async def batch_get_note_comments(self, note_list: List[str]):
        """Batch get note comments"""
        if not config.ENABLE_GET_COMMENTS:
            utils.logger.info(f"[XiaoHongShuCrawler.batch_get_note_comments] Crawling comment mode is not enabled")
            return
        utils.logger.info(
            f"[XiaoHongShuCrawler.batch_get_note_comments] Begin batch get note comments, note list: {note_list}")
        semaphore = asyncio.Semaphore(config.MAX_CONCURRENCY_NUM)
        task_list: List[Task] = []
        for note_id in note_list:
            task = asyncio.create_task(self.get_comments(note_id, semaphore), name=note_id)
            task_list.append(task)
        await asyncio.gather(*task_list)
    async def get_comments(self, note_id: str, semaphore: asyncio.Semaphore):
        """Get note comments with keyword filtering and quantity limitation"""
        async with semaphore:
            utils.logger.info(f"[XiaoHongShuCrawler.get_comments] Begin get note id comments {note_id}")
            await self.xhs_client.get_note_all_comments(
                note_id=note_id,
                crawl_interval=random.random(),
                callback=xhs_store.batch_update_xhs_note_comments
            )
    @staticmethod
    def format_proxy_info(ip_proxy_info: IpInfoModel) -> Tuple[Optional[Dict], Optional[Dict]]:
        """format proxy info for playwright and httpx"""
        playwright_proxy = {
            "server": f"{ip_proxy_info.protocol}{ip_proxy_info.ip}:{ip_proxy_info.port}",
            "username": ip_proxy_info.user,
            "password": ip_proxy_info.password,
        }
        httpx_proxy = {
            f"{ip_proxy_info.protocol}": f"http://{ip_proxy_info.user}:{ip_proxy_info.password}@{ip_proxy_info.ip}:{ip_proxy_info.port}"
        }
        return playwright_proxy, httpx_proxy
    async def create_xhs_client(self, httpx_proxy: Optional[str]) -> XiaoHongShuClient:
        """Create xhs client"""
        utils.logger.info("[XiaoHongShuCrawler.create_xhs_client] Begin create xiaohongshu API client ...")
        cookie_str, cookie_dict = utils.convert_cookies(await self.browser_context.cookies())
        xhs_client_obj = XiaoHongShuClient(
            proxies=httpx_proxy,
            headers={
                "User-Agent": self.user_agent,
                "Cookie": cookie_str,
                "Origin": "https://www.xiaohongshu.com",
                "Referer": "https://www.xiaohongshu.com",
                "Content-Type": "application/json;charset=UTF-8"
            },
            playwright_page=self.context_page,
            cookie_dict=cookie_dict,
        )
        return xhs_client_obj
    async def launch_browser(
            self,
            chromium: BrowserType,
            playwright_proxy: Optional[Dict],
            user_agent: Optional[str],
            headless: bool = True
    ) -> BrowserContext:
        """Launch browser and create browser context"""
        utils.logger.info("[XiaoHongShuCrawler.launch_browser] Begin create browser context ...")
        if config.SAVE_LOGIN_STATE:
            # feat issue #14
            # we will save login state to avoid login every time
            user_data_dir = os.path.join(os.getcwd(), "browser_data",
                                         config.USER_DATA_DIR % config.PLATFORM)  # type: ignore
            browser_context = await chromium.launch_persistent_context(
                user_data_dir=user_data_dir,
                accept_downloads=True,
                headless=headless,
                proxy=playwright_proxy,  # type: ignore
                viewport={"width": 1920, "height": 1080},
                user_agent=user_agent
            )
            return browser_context
        else:
            browser = await chromium.launch(headless=headless, proxy=playwright_proxy)  # type: ignore
            browser_context = await browser.new_context(
                viewport={"width": 1920, "height": 1080},
                user_agent=user_agent
            )
            return browser_context
    async def close(self):
        """Close browser context"""
        await self.browser_context.close()
        utils.logger.info("[XiaoHongShuCrawler.close] Browser context closed ...")()
--- a/media_platform/xhs/exception.py
+++ b/media_platform/xhs/exception.py
@ -0,0 +1,9 @@
 from httpx import RequestError
 class DataFetchError(RequestError):
    """something error when fetch"""
 class IPBlockError(RequestError):
    """fetch so fast that the server block us ip"""
--- a/media_platform/xhs/field.py
+++ b/media_platform/xhs/field.py
@ -0,0 +1,72 @@
 from enum import Enum
 from typing import NamedTuple
 class FeedType(Enum):
    # 推荐
    RECOMMEND = "homefeed_recommend"
    # 穿搭
    FASION = "homefeed.fashion_v3"
    # 美食
    FOOD = "homefeed.food_v3"
    # 彩妆
    COSMETICS = "homefeed.cosmetics_v3"
    # 影视
    MOVIE = "homefeed.movie_and_tv_v3"
    # 职场
    CAREER = "homefeed.career_v3"
    # 情感
    EMOTION = "homefeed.love_v3"
    # 家居
    HOURSE = "homefeed.household_product_v3"
    # 游戏
    GAME = "homefeed.gaming_v3"
    # 旅行
    TRAVEL = "homefeed.travel_v3"
    # 健身
    FITNESS = "homefeed.fitness_v3"
 class NoteType(Enum):
    NORMAL = "normal"
    VIDEO = "video"
 class SearchSortType(Enum):
    """search sort type"""
    # default
    GENERAL = "general"
    # most popular
    MOST_POPULAR = "popularity_descending"
    # Latest
    LATEST = "time_descending"
 class SearchNoteType(Enum):
    """search note type
    """
    # default
    ALL = 0
    # only video
    VIDEO = 1
    # only image
    IMAGE = 2
 class Note(NamedTuple):
    """note tuple"""
    note_id: str
    title: str
    desc: str
    type: str
    user: dict
    img_urls: list
    video_url: str
    tag_list: list
    at_user_list: list
    collected_count: str
    comment_count: str
    liked_count: str
    share_count: str
    time: int
    last_update_time: int
--- a/media_platform/xhs/help.py
+++ b/media_platform/xhs/help.py
@ -0,0 +1,287 @@
 import ctypes
 import json
 import random
 import time
 import urllib.parse
 def sign(a1="", b1="", x_s="", x_t=""):
    """
    takes in a URI (uniform resource identifier), an optional data dictionary, and an optional ctime parameter. It returns a dictionary containing two keys: "x-s" and "x-t".
    """
    common = {
        "s0": 5,  # getPlatformCode
        "s1": "",
        "x0": "1",  # localStorage.getItem("b1b1")
        "x1": "3.3.0",  # version
        "x2": "Windows",
        "x3": "xhs-pc-web",
        "x4": "1.4.4",
        "x5": a1,  # cookie of a1
        "x6": x_t,
        "x7": x_s,
        "x8": b1,  # localStorage.getItem("b1")
        "x9": mrc(x_t + x_s + b1),
        "x10": 1,  # getSigCount
    }
    encode_str = encodeUtf8(json.dumps(common, separators=(',', ':')))
    x_s_common = b64Encode(encode_str)
    x_b3_traceid = get_b3_trace_id()
    return {
        "x-s": x_s,
        "x-t": x_t,
        "x-s-common": x_s_common,
        "x-b3-traceid": x_b3_traceid
    }
 def get_b3_trace_id():
    re = "abcdef0123456789"
    je = 16
    e = ""
    for t in range(16):
        e += re[random.randint(0, je - 1)]
    return e
 def mrc(e):
    ie = [
        0, 1996959894, 3993919788, 2567524794, 124634137, 1886057615, 3915621685,
        2657392035, 249268274, 2044508324, 3772115230, 2547177864, 162941995,
        2125561021, 3887607047, 2428444049, 498536548, 1789927666, 4089016648,
        2227061214, 450548861, 1843258603, 4107580753, 2211677639, 325883990,
        1684777152, 4251122042, 2321926636, 335633487, 1661365465, 4195302755,
        2366115317, 997073096, 1281953886, 3579855332, 2724688242, 1006888145,
        1258607687, 3524101629, 2768942443, 901097722, 1119000684, 3686517206,
        2898065728, 853044451, 1172266101, 3705015759, 2882616665, 651767980,
        1373503546, 3369554304, 3218104598, 565507253, 1454621731, 3485111705,
        3099436303, 671266974, 1594198024, 3322730930, 2970347812, 795835527,
        1483230225, 3244367275, 3060149565, 1994146192, 31158534, 2563907772,
        4023717930, 1907459465, 112637215, 2680153253, 3904427059, 2013776290,
        251722036, 2517215374, 3775830040, 2137656763, 141376813, 2439277719,
        3865271297, 1802195444, 476864866, 2238001368, 4066508878, 1812370925,
        453092731, 2181625025, 4111451223, 1706088902, 314042704, 2344532202,
        4240017532, 1658658271, 366619977, 2362670323, 4224994405, 1303535960,
        984961486, 2747007092, 3569037538, 1256170817, 1037604311, 2765210733,
        3554079995, 1131014506, 879679996, 2909243462, 3663771856, 1141124467,
        855842277, 2852801631, 3708648649, 1342533948, 654459306, 3188396048,
        3373015174, 1466479909, 544179635, 3110523913, 3462522015, 1591671054,
        702138776, 2966460450, 3352799412, 1504918807, 783551873, 3082640443,
        3233442989, 3988292384, 2596254646, 62317068, 1957810842, 3939845945,
        2647816111, 81470997, 1943803523, 3814918930, 2489596804, 225274430,
        2053790376, 3826175755, 2466906013, 167816743, 2097651377, 4027552580,
        2265490386, 503444072, 1762050814, 4150417245, 2154129355, 426522225,
        1852507879, 4275313526, 2312317920, 282753626, 1742555852, 4189708143,
        2394877945, 397917763, 1622183637, 3604390888, 2714866558, 953729732,
        1340076626, 3518719985, 2797360999, 1068828381, 1219638859, 3624741850,
        2936675148, 906185462, 1090812512, 3747672003, 2825379669, 829329135,
        1181335161, 3412177804, 3160834842, 628085408, 1382605366, 3423369109,
        3138078467, 570562233, 1426400815, 3317316542, 2998733608, 733239954,
        1555261956, 3268935591, 3050360625, 752459403, 1541320221, 2607071920,
        3965973030, 1969922972, 40735498, 2617837225, 3943577151, 1913087877,
        83908371, 2512341634, 3803740692, 2075208622, 213261112, 2463272603,
        3855990285, 2094854071, 198958881, 2262029012, 4057260610, 1759359992,
        534414190, 2176718541, 4139329115, 1873836001, 414664567, 2282248934,
        4279200368, 1711684554, 285281116, 2405801727, 4167216745, 1634467795,
        376229701, 2685067896, 3608007406, 1308918612, 956543938, 2808555105,
        3495958263, 1231636301, 1047427035, 2932959818, 3654703836, 1088359270,
        936918000, 2847714899, 3736837829, 1202900863, 817233897, 3183342108,
        3401237130, 1404277552, 615818150, 3134207493, 3453421203, 1423857449,
        601450431, 3009837614, 3294710456, 1567103746, 711928724, 3020668471,
        3272380065, 1510334235, 755167117,
    ]
    o = -1
    def right_without_sign(num: int, bit: int=0) -> int:
        val = ctypes.c_uint32(num).value >> bit
        MAX32INT = 4294967295
        return (val + (MAX32INT + 1)) % (2 * (MAX32INT + 1)) - MAX32INT - 1
    for n in range(57):
        o = ie[(o & 255) ^ ord(e[n])] ^ right_without_sign(o, 8)
    return o ^ -1 ^ 3988292384
 lookup = [
    "Z",
    "m",
    "s",
    "e",
    "r",
    "b",
    "B",
    "o",
    "H",
    "Q",
    "t",
    "N",
    "P",
    "+",
    "w",
    "O",
    "c",
    "z",
    "a",
    "/",
    "L",
    "p",
    "n",
    "g",
    "G",
    "8",
    "y",
    "J",
    "q",
    "4",
    "2",
    "K",
    "W",
    "Y",
    "j",
    "0",
    "D",
    "S",
    "f",
    "d",
    "i",
    "k",
    "x",
    "3",
    "V",
    "T",
    "1",
    "6",
    "I",
    "l",
    "U",
    "A",
    "F",
    "M",
    "9",
    "7",
    "h",
    "E",
    "C",
    "v",
    "u",
    "R",
    "X",
    "5",
 ]
 def tripletToBase64(e):
    return (
            lookup[63 & (e >> 18)] +
            lookup[63 & (e >> 12)] +
            lookup[(e >> 6) & 63] +
            lookup[e & 63]
    )
 def encodeChunk(e, t, r):
    m = []
    for b in range(t, r, 3):
        n = (16711680 & (e[b] << 16)) + \
            ((e[b + 1] << 8) & 65280) + (e[b + 2] & 255)
        m.append(tripletToBase64(n))
    return ''.join(m)
 def b64Encode(e):
    P = len(e)
    W = P % 3
    U = []
    z = 16383
    H = 0
    Z = P - W
    while H < Z:
        U.append(encodeChunk(e, H, Z if H + z > Z else H + z))
        H += z
    if 1 == W:
        F = e[P - 1]
        U.append(lookup[F >> 2] + lookup[(F << 4) & 63] + "==")
    elif 2 == W:
        F = (e[P - 2] << 8) + e[P - 1]
        U.append(lookup[F >> 10] + lookup[63 & (F >> 4)] +
                 lookup[(F << 2) & 63] + "=")
    return "".join(U)
 def encodeUtf8(e):
    b = []
    m = urllib.parse.quote(e, safe='~()*!.\'')
    w = 0
    while w < len(m):
        T = m[w]
        if T == "%":
            E = m[w + 1] + m[w + 2]
            S = int(E, 16)
            b.append(S)
            w += 2
        else:
            b.append(ord(T[0]))
        w += 1
    return b
 def base36encode(number, alphabet='0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZ'):
    """Converts an integer to a base36 string."""
    if not isinstance(number, int):
        raise TypeError('number must be an integer')
    base36 = ''
    sign = ''
    if number < 0:
        sign = '-'
        number = -number
    if 0 <= number < len(alphabet):
        return sign + alphabet[number]
    while number != 0:
        number, i = divmod(number, len(alphabet))
        base36 = alphabet[i] + base36
    return sign + base36
 def base36decode(number):
    return int(number, 36)
 def get_search_id():
    e = int(time.time() * 1000) << 64
    t = int(random.uniform(0, 2147483646))
    return base36encode((e + t))
 img_cdns = [
    "https://sns-img-qc.xhscdn.com",
    "https://sns-img-hw.xhscdn.com",
    "https://sns-img-bd.xhscdn.com",
    "https://sns-img-qn.xhscdn.com",
 ]
 def get_img_url_by_trace_id(trace_id: str, format_type: str = "png"):
    return f"{random.choice(img_cdns)}/{trace_id}?imageView2/format/{format_type}"
 def get_img_urls_by_trace_id(trace_id: str, format_type: str = "png"):
    return [f"{cdn}/{trace_id}?imageView2/format/{format_type}" for cdn in img_cdns]
 def get_trace_id(img_url: str):
    # 浏览器端上传的图片多了 /spectrum/ 这个路径
    return f"spectrum/{img_url.split('/')[-1]}" if img_url.find("spectrum") != -1 else img_url.split("/")[-1]
 if __name__ == '__main__':
    _img_url = "https://sns-img-bd.xhscdn.com/7a3abfaf-90c1-a828-5de7-022c80b92aa3"
    # 获取一个图片地址在多个cdn下的url地址
    # final_img_urls = get_img_urls_by_trace_id(get_trace_id(_img_url))
    final_img_url = get_img_url_by_trace_id(get_trace_id(_img_url))
    print(final_img_url)
--- a/media_platform/xhs/login.py
+++ b/media_platform/xhs/login.py
@ -0,0 +1,186 @@
 import asyncio
 import functools
 import sys
 from typing import Optional
 from playwright.async_api import BrowserContext, Page
 from tenacity import (RetryError, retry, retry_if_result, stop_after_attempt,
                      wait_fixed)
 import config
 from base.base_crawler import AbstractLogin
 from cache.cache_factory import CacheFactory
 from tools import utils
 class XiaoHongShuLogin(AbstractLogin):
    def __init__(self,
                 login_type: str,
                 browser_context: BrowserContext,
                 context_page: Page,
                 login_phone: Optional[str] = "",
                 cookie_str: str = ""
                 ):
        config.LOGIN_TYPE = login_type
        self.browser_context = browser_context
        self.context_page = context_page
        self.login_phone = login_phone
        self.cookie_str = cookie_str
    @retry(stop=stop_after_attempt(600), wait=wait_fixed(1), retry=retry_if_result(lambda value: value is False))
    async def check_login_state(self, no_logged_in_session: str) -> bool:
        """
            Check if the current login status is successful and return True otherwise return False
            retry decorator will retry 20 times if the return value is False, and the retry interval is 1 second
            if max retry times reached, raise RetryError
        """
        if "请通过验证" in await self.context_page.content():
            utils.logger.info("[XiaoHongShuLogin.check_login_state] 登录过程中出现验证码，请手动验证")
        current_cookie = await self.browser_context.cookies()
        _, cookie_dict = utils.convert_cookies(current_cookie)
        current_web_session = cookie_dict.get("web_session")
        if current_web_session != no_logged_in_session:
            return True
        return False
    async def begin(self):
        """Start login xiaohongshu"""
        utils.logger.info("[XiaoHongShuLogin.begin] Begin login xiaohongshu ...")
        if config.LOGIN_TYPE == "qrcode":
            await self.login_by_qrcode()
        elif config.LOGIN_TYPE == "phone":
            await self.login_by_mobile()
        elif config.LOGIN_TYPE == "cookie":
            await self.login_by_cookies()
        else:
            raise ValueError("[XiaoHongShuLogin.begin]I nvalid Login Type Currently only supported qrcode or phone or cookies ...")
    async def login_by_mobile(self):
        """Login xiaohongshu by mobile"""
        utils.logger.info("[XiaoHongShuLogin.login_by_mobile] Begin login xiaohongshu by mobile ...")
        await asyncio.sleep(1)
        try:
            # 小红书进入首页后，有可能不会自动弹出登录框，需要手动点击登录按钮
            login_button_ele = await self.context_page.wait_for_selector(
                selector="xpath=//*[@id='app']/div[1]/div[2]/div[1]/ul/div[1]/button",
                timeout=5000
            )
            await login_button_ele.click()
            # 弹窗的登录对话框也有两种形态，一种是直接可以看到手机号和验证码的
            # 另一种是需要点击切换到手机登录的
            element = await self.context_page.wait_for_selector(
                selector='xpath=//div[@class="login-container"]//div[@class="other-method"]/div[1]',
                timeout=5000
            )
            await element.click()
        except Exception as e:
            utils.logger.info("[XiaoHongShuLogin.login_by_mobile] have not found mobile button icon and keep going ...")
        await asyncio.sleep(1)
        login_container_ele = await self.context_page.wait_for_selector("div.login-container")
        input_ele = await login_container_ele.query_selector("label.phone > input")
        await input_ele.fill(self.login_phone)
        await asyncio.sleep(0.5)
        send_btn_ele = await login_container_ele.query_selector("label.auth-code > span")
        await send_btn_ele.click()  # 点击发送验证码
        sms_code_input_ele = await login_container_ele.query_selector("label.auth-code > input")
        submit_btn_ele = await login_container_ele.query_selector("div.input-container > button")
        cache_client = CacheFactory.create_cache(config.CACHE_TYPE_MEMORY)
        max_get_sms_code_time = 60 * 2  # 最长获取验证码的时间为2分钟
        no_logged_in_session = ""
        while max_get_sms_code_time > 0:
            utils.logger.info(f"[XiaoHongShuLogin.login_by_mobile] get sms code from redis remaining time {max_get_sms_code_time}s ...")
            await asyncio.sleep(1)
            sms_code_key = f"xhs_{self.login_phone}"
            sms_code_value = cache_client.get(sms_code_key)
            if not sms_code_value:
                max_get_sms_code_time -= 1
                continue
            current_cookie = await self.browser_context.cookies()
            _, cookie_dict = utils.convert_cookies(current_cookie)
            no_logged_in_session = cookie_dict.get("web_session")
            await sms_code_input_ele.fill(value=sms_code_value.decode())  # 输入短信验证码
            await asyncio.sleep(0.5)
            agree_privacy_ele = self.context_page.locator("xpath=//div[@class='agreements']//*[local-name()='svg']")
            await agree_privacy_ele.click()  # 点击同意隐私协议
            await asyncio.sleep(0.5)
            await submit_btn_ele.click()  # 点击登录
            # todo ... 应该还需要检查验证码的正确性有可能输入的验证码不正确
            break
        try:
            await self.check_login_state(no_logged_in_session)
        except RetryError:
            utils.logger.info("[XiaoHongShuLogin.login_by_mobile] Login xiaohongshu failed by mobile login method ...")
            sys.exit()
        wait_redirect_seconds = 5
        utils.logger.info(f"[XiaoHongShuLogin.login_by_mobile] Login successful then wait for {wait_redirect_seconds} seconds redirect ...")
        await asyncio.sleep(wait_redirect_seconds)
    async def login_by_qrcode(self):
        """login xiaohongshu website and keep webdriver login state"""
        utils.logger.info("[XiaoHongShuLogin.login_by_qrcode] Begin login xiaohongshu by qrcode ...")
        # login_selector = "div.login-container > div.left > div.qrcode > img"
        qrcode_img_selector = "xpath=//img[@class='qrcode-img']"
        # find login qrcode
        base64_qrcode_img = await utils.find_login_qrcode(
            self.context_page,
            selector=qrcode_img_selector
        )
        if not base64_qrcode_img:
            utils.logger.info("[XiaoHongShuLogin.login_by_qrcode] login failed , have not found qrcode please check ....")
            # if this website does not automatically popup login dialog box, we will manual click login button
            await asyncio.sleep(0.5)
            login_button_ele = self.context_page.locator("xpath=//*[@id='app']/div[1]/div[2]/div[1]/ul/div[1]/button")
            await login_button_ele.click()
            base64_qrcode_img = await utils.find_login_qrcode(
                self.context_page,
                selector=qrcode_img_selector
            )
            if not base64_qrcode_img:
                sys.exit()
        # get not logged session
        current_cookie = await self.browser_context.cookies()
        _, cookie_dict = utils.convert_cookies(current_cookie)
        no_logged_in_session = cookie_dict.get("web_session")
        # show login qrcode
        # fix issue #12
        # we need to use partial function to call show_qrcode function and run in executor
        # then current asyncio event loop will not be blocked
        partial_show_qrcode = functools.partial(utils.show_qrcode, base64_qrcode_img)
        asyncio.get_running_loop().run_in_executor(executor=None, func=partial_show_qrcode)
        utils.logger.info(f"[XiaoHongShuLogin.login_by_qrcode] waiting for scan code login, remaining time is 120s")
        try:
            await self.check_login_state(no_logged_in_session)
        except RetryError:
            utils.logger.info("[XiaoHongShuLogin.login_by_qrcode] Login xiaohongshu failed by qrcode login method ...")
            sys.exit()
        wait_redirect_seconds = 5
        utils.logger.info(f"[XiaoHongShuLogin.login_by_qrcode] Login successful then wait for {wait_redirect_seconds} seconds redirect ...")
        await asyncio.sleep(wait_redirect_seconds)
    async def login_by_cookies(self):
        """login xiaohongshu website by cookies"""
        utils.logger.info("[XiaoHongShuLogin.login_by_cookies] Begin login xiaohongshu by cookie ...")
        for key, value in utils.convert_str_cookie_to_dict(self.cookie_str).items():
            if key != "web_session":  # only set web_session cookie attr
                continue
            await self.browser_context.add_cookies([{
                'name': key,
                'value': value,
                'domain': ".xiaohongshu.com",
                'path': "/"
            }])
--- a/mypy.ini
+++ b/mypy.ini
@ -0,0 +1,9 @@
 [mypy]
 warn_return_any = True
 warn_unused_configs = True
 [mypy-cv2]
 ignore_missing_imports = True
 [mypy-execjs]
 ignore_missing_imports = True
--- a/note_info.txt
+++ b/note_info.txt
--- a/proxy/init.py
+++ b/proxy/init.py
@ -0,0 +1,5 @@
 # -*- coding: utf-8 -*-
 # @Author  : relakkes@gmail.com
 # @Time    : 2023/12/2 14:37
 # @Desc    : IP代理池入口
 from .base_proxy import *
--- a/proxy/base_proxy.py
+++ b/proxy/base_proxy.py
@ -0,0 +1,63 @@
 # -*- coding: utf-8 -*-
 # @Author  : relakkes@gmail.com
 # @Time    : 2023/12/2 11:18
 # @Desc    : 爬虫 IP 获取实现
 # @Url     : 快代理HTTP实现，官方文档：https://www.kuaidaili.com/?ref=ldwkjqipvz6c
 import json
 from abc import ABC, abstractmethod
 from typing import List
 import config
 from cache.abs_cache import AbstractCache
 from cache.cache_factory import CacheFactory
 from tools import utils
 from .types import IpInfoModel
 class IpGetError(Exception):
    """ ip get error"""
 class ProxyProvider(ABC):
    @abstractmethod
    async def get_proxies(self, num: int) -> List[IpInfoModel]:
        """
        获取 IP 的抽象方法，不同的 HTTP 代理商需要实现该方法
        :param num: 提取的 IP 数量
        :return:
        """
        pass
 class IpCache:
    def __init__(self):
        self.cache_client: AbstractCache = CacheFactory.create_cache(cache_type=config.CACHE_TYPE_MEMORY)
    def set_ip(self, ip_key: str, ip_value_info: str, ex: int):
        """
        设置IP并带有过期时间，到期之后由 redis 负责删除
        :param ip_key:
        :param ip_value_info:
        :param ex:
        :return:
        """
        self.cache_client.set(key=ip_key, value=ip_value_info, expire_time=ex)
    def load_all_ip(self, proxy_brand_name: str) -> List[IpInfoModel]:
        """
        从 redis 中加载所有还未过期的 IP 信息
        :param proxy_brand_name: 代理商名称
        :return:
        """
        all_ip_list: List[IpInfoModel] = []
        all_ip_keys: List[str] = self.cache_client.keys(pattern=f"{proxy_brand_name}_*")
        try:
            for ip_key in all_ip_keys:
                ip_value = self.cache_client.get(ip_key)
                if not ip_value:
                    continue
                all_ip_list.append(IpInfoModel(**json.loads(ip_value)))
        except Exception as e:
            utils.logger.error("[IpCache.load_all_ip] get ip err from redis db", e)
        return all_ip_list
--- a/proxy/providers/init.py
+++ b/proxy/providers/init.py
@ -0,0 +1,6 @@
 # -*- coding: utf-8 -*-
 # @Author  : relakkes@gmail.com
 # @Time    : 2024/4/5 10:13
 # @Desc    :
 from .jishu_http_proxy import new_jisu_http_proxy
 from .kuaidl_proxy import new_kuai_daili_proxy
--- a/proxy/providers/jishu_http_proxy.py
+++ b/proxy/providers/jishu_http_proxy.py
@ -0,0 +1,87 @@
 # -*- coding: utf-8 -*-
 # @Author  : relakkes@gmail.com
 # @Time    : 2024/4/5 09:32
 # @Desc    : 已废弃！！！！！倒闭了！！！极速HTTP 代理IP实现. 请使用快代理实现（proxy/providers/kuaidl_proxy.py）
 import os
 from typing import Dict, List
 from urllib.parse import urlencode
 import httpx
 from proxy import IpCache, IpGetError, ProxyProvider
 from proxy.types import IpInfoModel
 from tools import utils
 class JiSuHttpProxy(ProxyProvider):
    def __init__(self, key: str, crypto: str, time_validity_period: int):
        """
        极速HTTP 代理IP实现
        :param key: 提取key值 (去官网注册后获取)
        :param crypto: 加密签名 (去官网注册后获取)
        """
        self.proxy_brand_name = "JISUHTTP"
        self.api_path = "https://api.jisuhttp.com"
        self.params = {
            "key": key,
            "crypto": crypto,
            "time": time_validity_period,  # IP使用时长，支持3、5、10、15、30分钟时效
            "type": "json",  # 数据结果为json
            "port": "2",  # IP协议：1:HTTP、2:HTTPS、3:SOCKS5
            "pw": "1",  # 是否使用账密验证， 1：是，0：否，否表示白名单验证；默认为0
            "se": "1",  # 返回JSON格式时是否显示IP过期时间， 1：显示，0：不显示；默认为0
        }
        self.ip_cache = IpCache()
    async def get_proxies(self, num: int) -> List[IpInfoModel]:
        """
        :param num:
        :return:
        """
        # 优先从缓存中拿 IP
        ip_cache_list = self.ip_cache.load_all_ip(proxy_brand_name=self.proxy_brand_name)
        if len(ip_cache_list) >= num:
            return ip_cache_list[:num]
        # 如果缓存中的数量不够，从IP代理商获取补上，再存入缓存中
        need_get_count = num - len(ip_cache_list)
        self.params.update({"num": need_get_count})
        ip_infos = []
        async with httpx.AsyncClient() as client:
            url = self.api_path + "/fetchips" + '?' + urlencode(self.params)
            utils.logger.info(f"[JiSuHttpProxy.get_proxies] get ip proxy url:{url}")
            response = await client.get(url, headers={
                "User-Agent": "MediaCrawler https://github.com/NanmiCoder/MediaCrawler"})
            res_dict: Dict = response.json()
            if res_dict.get("code") == 0:
                data: List[Dict] = res_dict.get("data")
                current_ts = utils.get_unix_timestamp()
                for ip_item in data:
                    ip_info_model = IpInfoModel(
                        ip=ip_item.get("ip"),
                        port=ip_item.get("port"),
                        user=ip_item.get("user"),
                        password=ip_item.get("pass"),
                        expired_time_ts=utils.get_unix_time_from_time_str(ip_item.get("expire"))
                    )
                    ip_key = f"JISUHTTP_{ip_info_model.ip}_{ip_info_model.port}_{ip_info_model.user}_{ip_info_model.password}"
                    ip_value = ip_info_model.json()
                    ip_infos.append(ip_info_model)
                    self.ip_cache.set_ip(ip_key, ip_value, ex=ip_info_model.expired_time_ts - current_ts)
            else:
                raise IpGetError(res_dict.get("msg", "unkown err"))
        return ip_cache_list + ip_infos
 def new_jisu_http_proxy() -> JiSuHttpProxy:
    """
    构造极速HTTP实例
    Returns:
    """
    return JiSuHttpProxy(
        key=os.getenv("jisu_key", ""),  # 通过环境变量的方式获取极速HTTPIP提取key值
        crypto=os.getenv("jisu_crypto", ""),  # 通过环境变量的方式获取极速HTTPIP提取加密签名
        time_validity_period=30  # 30分钟（最长时效）
    )
--- a/proxy/providers/kuaidl_proxy.py
+++ b/proxy/providers/kuaidl_proxy.py
@ -0,0 +1,134 @@
 # -*- coding: utf-8 -*-
 # @Author  : relakkes@gmail.com
 # @Time    : 2024/4/5 09:43
 # @Desc    : 快代理HTTP实现，官方文档：https://www.kuaidaili.com/?ref=ldwkjqipvz6c
 import os
 import re
 from typing import Dict, List
 import httpx
 from pydantic import BaseModel, Field
 from proxy import IpCache, IpInfoModel, ProxyProvider
 from proxy.types import ProviderNameEnum
 from tools import utils
 class KuaidailiProxyModel(BaseModel):
    ip: str = Field("ip")
    port: int = Field("端口")
    expire_ts: int = Field("过期时间")
 def parse_kuaidaili_proxy(proxy_info: str) -> KuaidailiProxyModel:
    """
    解析快代理的IP信息
    Args:
        proxy_info:
    Returns:
    """
    proxies: List[str] = proxy_info.split(":")
    if len(proxies) != 2:
        raise Exception("not invalid kuaidaili proxy info")
    pattern = r'(\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}):(\d{1,5}),(\d+)'
    match = re.search(pattern, proxy_info)
    if not match.groups():
        raise Exception("not match kuaidaili proxy info")
    return KuaidailiProxyModel(
        ip=match.groups()[0],
        port=int(match.groups()[1]),
        expire_ts=int(match.groups()[2])
    )
 class KuaiDaiLiProxy(ProxyProvider):
    def __init__(self, kdl_user_name: str, kdl_user_pwd: str, kdl_secret_id: str, kdl_signature: str):
        """
        Args:
            kdl_user_name:
            kdl_user_pwd:
        """
        self.kdl_user_name = kdl_user_name
        self.kdl_user_pwd = kdl_user_pwd
        self.api_base = "https://dps.kdlapi.com/"
        self.secret_id = kdl_secret_id
        self.signature = kdl_signature
        self.ip_cache = IpCache()
        self.proxy_brand_name = ProviderNameEnum.KUAI_DAILI_PROVIDER.value
        self.params = {
            "secret_id": self.secret_id,
            "signature": self.signature,
            "pt": 1,
            "format": "json",
            "sep": 1,
            "f_et": 1,
        }
    async def get_proxies(self, num: int) -> List[IpInfoModel]:
        """
        快代理实现
        Args:
            num:
        Returns:
        """
        uri = "/api/getdps/"
        # 优先从缓存中拿 IP
        ip_cache_list = self.ip_cache.load_all_ip(proxy_brand_name=self.proxy_brand_name)
        if len(ip_cache_list) >= num:
            return ip_cache_list[:num]
        # 如果缓存中的数量不够，从IP代理商获取补上，再存入缓存中
        need_get_count = num - len(ip_cache_list)
        self.params.update({"num": need_get_count})
        ip_infos: List[IpInfoModel] = []
        async with httpx.AsyncClient() as client:
            response = await client.get(self.api_base + uri, params=self.params)
            if response.status_code != 200:
                utils.logger.error(f"[KuaiDaiLiProxy.get_proxies] statuc code not 200 and response.txt:{response.text}")
                raise Exception("get ip error from proxy provider and status code not 200 ...")
            ip_response: Dict = response.json()
            if ip_response.get("code") != 0:
                utils.logger.error(f"[KuaiDaiLiProxy.get_proxies]  code not 0 and msg:{ip_response.get('msg')}")
                raise Exception("get ip error from proxy provider and  code not 0 ...")
            proxy_list: List[str] = ip_response.get("data", {}).get("proxy_list")
            for proxy in proxy_list:
                proxy_model = parse_kuaidaili_proxy(proxy)
                ip_info_model = IpInfoModel(
                    ip=proxy_model.ip,
                    port=proxy_model.port,
                    user=self.kdl_user_name,
                    password=self.kdl_user_pwd,
                    expired_time_ts=proxy_model.expire_ts,
                )
                ip_key = f"{self.proxy_brand_name}_{ip_info_model.ip}_{ip_info_model.port}"
                self.ip_cache.set_ip(ip_key, ip_info_model.model_dump_json(), ex=ip_info_model.expired_time_ts)
                ip_infos.append(ip_info_model)
        return ip_cache_list + ip_infos
 def new_kuai_daili_proxy() -> KuaiDaiLiProxy:
    """
    构造快代理HTTP实例
    Returns:
    """
    return KuaiDaiLiProxy(
        kdl_secret_id=os.getenv("kdl_secret_id", "你的快代理secert_id"),
        kdl_signature=os.getenv("kdl_signature", "你的快代理签名"),
        kdl_user_name=os.getenv("kdl_user_name", "你的快代理用户名"),
        kdl_user_pwd=os.getenv("kdl_user_pwd", "你的快代理密码"),
    )
--- a/proxy/proxy_ip_pool.py
+++ b/proxy/proxy_ip_pool.py
@ -0,0 +1,110 @@
 # -*- coding: utf-8 -*-
 # @Author  : relakkes@gmail.com
 # @Time    : 2023/12/2 13:45
 # @Desc    : ip代理池实现
 import random
 from typing import Dict, List
 import httpx
 from tenacity import retry, stop_after_attempt, wait_fixed
 import config
 from proxy.providers import new_jisu_http_proxy, new_kuai_daili_proxy
 from tools import utils
 from .base_proxy import ProxyProvider
 from .types import IpInfoModel, ProviderNameEnum
 class ProxyIpPool:
    def __init__(self, ip_pool_count: int, enable_validate_ip: bool, ip_provider: ProxyProvider) -> None:
        """
        Args:
            ip_pool_count:
            enable_validate_ip:
            ip_provider:
        """
        self.valid_ip_url = "https://httpbin.org/ip"  # 验证 IP 是否有效的地址
        self.ip_pool_count = ip_pool_count
        self.enable_validate_ip = enable_validate_ip
        self.proxy_list: List[IpInfoModel] = []
        self.ip_provider: ProxyProvider = ip_provider
    async def load_proxies(self) -> None:
        """
        加载IP代理
        Returns:
        """
        self.proxy_list = await self.ip_provider.get_proxies(self.ip_pool_count)
    async def _is_valid_proxy(self, proxy: IpInfoModel) -> bool:
        """
        验证代理IP是否有效
        :param proxy:
        :return:
        """
        utils.logger.info(f"[ProxyIpPool._is_valid_proxy] testing {proxy.ip} is it valid ")
        try:
            httpx_proxy = {
                f"{proxy.protocol}": f"http://{proxy.user}:{proxy.password}@{proxy.ip}:{proxy.port}"
            }
            async with httpx.AsyncClient(proxies=httpx_proxy) as client:
                response = await client.get(self.valid_ip_url)
            if response.status_code == 200:
                return True
            else:
                return False
        except Exception as e:
            utils.logger.info(f"[ProxyIpPool._is_valid_proxy] testing {proxy.ip} err: {e}")
            raise e
    @retry(stop=stop_after_attempt(3), wait=wait_fixed(1))
    async def get_proxy(self) -> IpInfoModel:
        """
        从代理池中随机提取一个代理IP
        :return:
        """
        if len(self.proxy_list) == 0:
            await self._reload_proxies()
        proxy = random.choice(self.proxy_list)
        self.proxy_list.remove(proxy) # 取出来一个IP就应该移出掉
        if self.enable_validate_ip:
            if not await self._is_valid_proxy(proxy):
                raise Exception("[ProxyIpPool.get_proxy] current ip invalid and again get it")
        return proxy
    async def _reload_proxies(self):
        """
        # 重新加载代理池
        :return:
        """
        self.proxy_list = []
        await self.load_proxies()
 IpProxyProvider: Dict[str, ProxyProvider] = {
    ProviderNameEnum.JISHU_HTTP_PROVIDER.value: new_jisu_http_proxy(),
    ProviderNameEnum.KUAI_DAILI_PROVIDER.value: new_kuai_daili_proxy()
 }
 async def create_ip_pool(ip_pool_count: int, enable_validate_ip: bool) -> ProxyIpPool:
    """
     创建 IP 代理池
    :param ip_pool_count: ip池子的数量
    :param enable_validate_ip: 是否开启验证IP代理
    :return:
    """
    pool = ProxyIpPool(ip_pool_count=ip_pool_count,
                       enable_validate_ip=enable_validate_ip,
                       ip_provider=IpProxyProvider.get(config.IP_PROXY_PROVIDER_NAME)
                       )
    await pool.load_proxies()
    return pool
 if __name__ == '__main__':
    pass
--- a/proxy/types.py
+++ b/proxy/types.py
@ -0,0 +1,23 @@
 # -*- coding: utf-8 -*-
 # @Author  : relakkes@gmail.com
 # @Time    : 2024/4/5 10:18
 # @Desc    : 基础类型
 from enum import Enum
 from typing import Optional
 from pydantic import BaseModel, Field
 class ProviderNameEnum(Enum):
    JISHU_HTTP_PROVIDER: str = "jishuhttp"
    KUAI_DAILI_PROVIDER: str = "kuaidaili"
 class IpInfoModel(BaseModel):
    """Unified IP model"""
    ip: str = Field(title="ip")
    port: int = Field(title="端口")
    user: str = Field(title="IP代理认证的用户名")
    protocol: str = Field(default="https://", title="代理IP的协议")
    password: str = Field(title="IP代理认证用户的密码")
    expired_time_ts: Optional[int] = Field(title="IP 过期时间")
--- a/recv_sms.py
+++ b/recv_sms.py
@ -0,0 +1,68 @@
 import re
 from typing import List
 import uvicorn
 from fastapi import FastAPI, HTTPException, status
 from pydantic import BaseModel
 import config
 from cache.abs_cache import AbstractCache
 from cache.cache_factory import CacheFactory
 from tools import utils
 app = FastAPI()
 cache_client : AbstractCache = CacheFactory.create_cache(cache_type=config.CACHE_TYPE_MEMORY)
 class SmsNotification(BaseModel):
    platform: str
    current_number: str
    from_number: str
    sms_content: str
    timestamp: str
 def extract_verification_code(message: str) -> str:
    """
    Extract verification code of 6 digits from the SMS.
    """
    pattern = re.compile(r'\b[0-9]{6}\b')
    codes: List[str] = pattern.findall(message)
    return codes[0] if codes else ""
@app.post("/")
 def receive_sms_notification(sms: SmsNotification):
    """
    Receive SMS notification and send it to Redis.
    Args:
        sms:
            {
                "platform": "xhs",
                "from_number": "1069421xxx134",
                "sms_content": "【小红书】您的验证码是: 171959， 3分钟内有效。请勿向他人泄漏。如非本人操作，可忽略本消息。",
                "timestamp": "1686720601614",
                "current_number": "13152442222"
            }
    Returns:
    """
    utils.logger.info(f"Received SMS notification: {sms.platform}, {sms.current_number}")
    sms_code = extract_verification_code(sms.sms_content)
    if sms_code:
        # Save the verification code in Redis and set the expiration time to 3 minutes.
        key = f"{sms.platform}_{sms.current_number}"
        cache_client.set(key, sms_code, expire_time=60 * 3)
    return {"status": "ok"}
@app.get("/", status_code=status.HTTP_404_NOT_FOUND)
 async def not_found():
    raise HTTPException(status_code=404, detail="Not Found")
 if __name__ == '__main__':
    uvicorn.run(app, port=8000, host='0.0.0.0')
--- a/requirements.txt
+++ b/requirements.txt
@ -0,0 +1,17 @@
 httpx==0.24.0
 Pillow==9.5.0
 playwright==1.42.0
 tenacity==8.2.2
 PyExecJS==1.5.1
 opencv-python
 aiomysql==0.2.0
 redis~=4.6.0
 pydantic==2.5.2
 aiofiles~=23.2.1
 fastapi==0.110.2
 uvicorn==0.29.0
 python-dotenv==1.0.1
 jieba==0.42.1
 wordcloud==1.9.3
 matplotlib==3.9.0
 beautifulsoup4==4.12.3
--- a/schema/tables.sql
+++ b/schema/tables.sql
@ -0,0 +1,317 @@
 -- ----------------------------
 -- Table structure for bilibili_video
 -- ----------------------------
 DROP TABLE IF EXISTS `bilibili_video`;
 CREATE TABLE `bilibili_video` (
  `id` int NOT NULL AUTO_INCREMENT COMMENT '自增ID',
  `user_id` varchar(64) DEFAULT NULL COMMENT '用户ID',
  `nickname` varchar(64) DEFAULT NULL COMMENT '用户昵称',
  `avatar` varchar(255) DEFAULT NULL COMMENT '用户头像地址',
  `add_ts` bigint NOT NULL COMMENT '记录添加时间戳',
  `last_modify_ts` bigint NOT NULL COMMENT '记录最后修改时间戳',
  `video_id` varchar(64) NOT NULL COMMENT '视频ID',
  `video_type` varchar(16) NOT NULL COMMENT '视频类型',
  `title` varchar(500) DEFAULT NULL COMMENT '视频标题',
  `desc` longtext COMMENT '视频描述',
  `create_time` bigint NOT NULL COMMENT '视频发布时间戳',
  `liked_count` varchar(16) DEFAULT NULL COMMENT '视频点赞数',
  `video_play_count` varchar(16) DEFAULT NULL COMMENT '视频播放数量',
  `video_danmaku` varchar(16) DEFAULT NULL COMMENT '视频弹幕数量',
  `video_comment` varchar(16) DEFAULT NULL COMMENT '视频评论数量',
  `video_url` varchar(512) DEFAULT NULL COMMENT '视频详情URL',
  `video_cover_url` varchar(512) DEFAULT NULL COMMENT '视频封面图 URL',
  PRIMARY KEY (`id`),
  KEY `idx_bilibili_vi_video_i_31c36e` (`video_id`),
  KEY `idx_bilibili_vi_create__73e0ec` (`create_time`)
 ) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_0900_ai_ci COMMENT='B站视频';
 -- ----------------------------
 -- Table structure for bilibili_video_comment
 -- ----------------------------
 DROP TABLE IF EXISTS `bilibili_video_comment`;
 CREATE TABLE `bilibili_video_comment` (
  `id` int NOT NULL AUTO_INCREMENT COMMENT '自增ID',
  `user_id` varchar(64) DEFAULT NULL COMMENT '用户ID',
  `nickname` varchar(64) DEFAULT NULL COMMENT '用户昵称',
  `avatar` varchar(255) DEFAULT NULL COMMENT '用户头像地址',
  `add_ts` bigint NOT NULL COMMENT '记录添加时间戳',
  `last_modify_ts` bigint NOT NULL COMMENT '记录最后修改时间戳',
  `comment_id` varchar(64) NOT NULL COMMENT '评论ID',
  `video_id` varchar(64) NOT NULL COMMENT '视频ID',
  `content` longtext COMMENT '评论内容',
  `create_time` bigint NOT NULL COMMENT '评论时间戳',
  `sub_comment_count` varchar(16) NOT NULL COMMENT '评论回复数',
  PRIMARY KEY (`id`),
  KEY `idx_bilibili_vi_comment_41c34e` (`comment_id`),
  KEY `idx_bilibili_vi_video_i_f22873` (`video_id`)
 ) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_0900_ai_ci COMMENT='B 站视频评论';
 -- ----------------------------
 -- Table structure for douyin_aweme
 -- ----------------------------
 DROP TABLE IF EXISTS `douyin_aweme`;
 CREATE TABLE `douyin_aweme` (
  `id` int NOT NULL AUTO_INCREMENT COMMENT '自增ID',
  `user_id` varchar(64) DEFAULT NULL COMMENT '用户ID',
  `sec_uid` varchar(128) DEFAULT NULL COMMENT '用户sec_uid',
  `short_user_id` varchar(64) DEFAULT NULL COMMENT '用户短ID',
  `user_unique_id` varchar(64) DEFAULT NULL COMMENT '用户唯一ID',
  `nickname` varchar(64) DEFAULT NULL COMMENT '用户昵称',
  `avatar` varchar(255) DEFAULT NULL COMMENT '用户头像地址',
  `user_signature` varchar(500) DEFAULT NULL COMMENT '用户签名',
  `ip_location` varchar(255) DEFAULT NULL COMMENT '评论时的IP地址',
  `add_ts` bigint NOT NULL COMMENT '记录添加时间戳',
  `last_modify_ts` bigint NOT NULL COMMENT '记录最后修改时间戳',
  `aweme_id` varchar(64) NOT NULL COMMENT '视频ID',
  `aweme_type` varchar(16) NOT NULL COMMENT '视频类型',
  `title` varchar(500) DEFAULT NULL COMMENT '视频标题',
  `desc` longtext COMMENT '视频描述',
  `create_time` bigint NOT NULL COMMENT '视频发布时间戳',
  `liked_count` varchar(16) DEFAULT NULL COMMENT '视频点赞数',
  `comment_count` varchar(16) DEFAULT NULL COMMENT '视频评论数',
  `share_count` varchar(16) DEFAULT NULL COMMENT '视频分享数',
  `collected_count` varchar(16) DEFAULT NULL COMMENT '视频收藏数',
  `aweme_url` varchar(255) DEFAULT NULL COMMENT '视频详情页URL',
  PRIMARY KEY (`id`),
  KEY `idx_douyin_awem_aweme_i_6f7bc6` (`aweme_id`),
  KEY `idx_douyin_awem_create__299dfe` (`create_time`)
 ) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_0900_ai_ci COMMENT='抖音视频';
 -- ----------------------------
 -- Table structure for douyin_aweme_comment
 -- ----------------------------
 DROP TABLE IF EXISTS `douyin_aweme_comment`;
 CREATE TABLE `douyin_aweme_comment` (
  `id` int NOT NULL AUTO_INCREMENT COMMENT '自增ID',
  `user_id` varchar(64) DEFAULT NULL COMMENT '用户ID',
  `sec_uid` varchar(128) DEFAULT NULL COMMENT '用户sec_uid',
  `short_user_id` varchar(64) DEFAULT NULL COMMENT '用户短ID',
  `user_unique_id` varchar(64) DEFAULT NULL COMMENT '用户唯一ID',
  `nickname` varchar(64) DEFAULT NULL COMMENT '用户昵称',
  `avatar` varchar(255) DEFAULT NULL COMMENT '用户头像地址',
  `user_signature` varchar(500) DEFAULT NULL COMMENT '用户签名',
  `ip_location` varchar(255) DEFAULT NULL COMMENT '评论时的IP地址',
  `add_ts` bigint NOT NULL COMMENT '记录添加时间戳',
  `last_modify_ts` bigint NOT NULL COMMENT '记录最后修改时间戳',
  `comment_id` varchar(64) NOT NULL COMMENT '评论ID',
  `aweme_id` varchar(64) NOT NULL COMMENT '视频ID',
  `content` longtext COMMENT '评论内容',
  `create_time` bigint NOT NULL COMMENT '评论时间戳',
  `sub_comment_count` varchar(16) NOT NULL COMMENT '评论回复数',
  PRIMARY KEY (`id`),
  KEY `idx_douyin_awem_comment_fcd7e4` (`comment_id`),
  KEY `idx_douyin_awem_aweme_i_c50049` (`aweme_id`)
 ) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_0900_ai_ci COMMENT='抖音视频评论';
 -- ----------------------------
 -- Table structure for dy_creator
 -- ----------------------------
 DROP TABLE IF EXISTS `dy_creator`;
 CREATE TABLE `dy_creator` (
  `id` int NOT NULL AUTO_INCREMENT COMMENT '自增ID',
  `user_id` varchar(128) NOT NULL COMMENT '用户ID',
  `nickname` varchar(64) DEFAULT NULL COMMENT '用户昵称',
  `avatar` varchar(255) DEFAULT NULL COMMENT '用户头像地址',
  `ip_location` varchar(255) DEFAULT NULL COMMENT '评论时的IP地址',
  `add_ts` bigint NOT NULL COMMENT '记录添加时间戳',
  `last_modify_ts` bigint NOT NULL COMMENT '记录最后修改时间戳',
  `desc` longtext COMMENT '用户描述',
  `gender` varchar(1) DEFAULT NULL COMMENT '性别',
  `follows` varchar(16) DEFAULT NULL COMMENT '关注数',
  `fans` varchar(16) DEFAULT NULL COMMENT '粉丝数',
  `interaction` varchar(16) DEFAULT NULL COMMENT '获赞数',
  `videos_count` varchar(16) DEFAULT NULL COMMENT '作品数',
  PRIMARY KEY (`id`)
 ) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_0900_ai_ci COMMENT='抖音博主信息';
 -- ----------------------------
 -- Table structure for kuaishou_video
 -- ----------------------------
 DROP TABLE IF EXISTS `kuaishou_video`;
 CREATE TABLE `kuaishou_video` (
  `id` int NOT NULL AUTO_INCREMENT COMMENT '自增ID',
  `user_id` varchar(64) DEFAULT NULL COMMENT '用户ID',
  `nickname` varchar(64) DEFAULT NULL COMMENT '用户昵称',
  `avatar` varchar(255) DEFAULT NULL COMMENT '用户头像地址',
  `add_ts` bigint NOT NULL COMMENT '记录添加时间戳',
  `last_modify_ts` bigint NOT NULL COMMENT '记录最后修改时间戳',
  `video_id` varchar(64) NOT NULL COMMENT '视频ID',
  `video_type` varchar(16) NOT NULL COMMENT '视频类型',
  `title` varchar(500) DEFAULT NULL COMMENT '视频标题',
  `desc` longtext COMMENT '视频描述',
  `create_time` bigint NOT NULL COMMENT '视频发布时间戳',
  `liked_count` varchar(16) DEFAULT NULL COMMENT '视频点赞数',
  `viewd_count` varchar(16) DEFAULT NULL COMMENT '视频浏览数量',
  `video_url` varchar(512) DEFAULT NULL COMMENT '视频详情URL',
  `video_cover_url` varchar(512) DEFAULT NULL COMMENT '视频封面图 URL',
  `video_play_url` varchar(512) DEFAULT NULL COMMENT '视频播放 URL',
  PRIMARY KEY (`id`),
  KEY `idx_kuaishou_vi_video_i_c5c6a6` (`video_id`),
  KEY `idx_kuaishou_vi_create__a10dee` (`create_time`)
 ) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_0900_ai_ci COMMENT='快手视频';
 -- ----------------------------
 -- Table structure for kuaishou_video_comment
 -- ----------------------------
 DROP TABLE IF EXISTS `kuaishou_video_comment`;
 CREATE TABLE `kuaishou_video_comment` (
  `id` int NOT NULL AUTO_INCREMENT COMMENT '自增ID',
  `user_id` varchar(64) DEFAULT NULL COMMENT '用户ID',
  `nickname` varchar(64) DEFAULT NULL COMMENT '用户昵称',
  `avatar` varchar(255) DEFAULT NULL COMMENT '用户头像地址',
  `add_ts` bigint NOT NULL COMMENT '记录添加时间戳',
  `last_modify_ts` bigint NOT NULL COMMENT '记录最后修改时间戳',
  `comment_id` varchar(64) NOT NULL COMMENT '评论ID',
  `video_id` varchar(64) NOT NULL COMMENT '视频ID',
  `content` longtext COMMENT '评论内容',
  `create_time` bigint NOT NULL COMMENT '评论时间戳',
  `sub_comment_count` varchar(16) NOT NULL COMMENT '评论回复数',
  PRIMARY KEY (`id`),
  KEY `idx_kuaishou_vi_comment_ed48fa` (`comment_id`),
  KEY `idx_kuaishou_vi_video_i_e50914` (`video_id`)
 ) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_0900_ai_ci COMMENT='快手视频评论';
 -- ----------------------------
 -- Table structure for weibo_note
 -- ----------------------------
 DROP TABLE IF EXISTS `weibo_note`;
 CREATE TABLE `weibo_note` (
  `id` int NOT NULL AUTO_INCREMENT COMMENT '自增ID',
  `user_id` varchar(64) DEFAULT NULL COMMENT '用户ID',
  `nickname` varchar(64) DEFAULT NULL COMMENT '用户昵称',
  `avatar` varchar(255) DEFAULT NULL COMMENT '用户头像地址',
  `gender` varchar(12) DEFAULT NULL COMMENT '用户性别',
  `profile_url` varchar(255) DEFAULT NULL COMMENT '用户主页地址',
  `ip_location` varchar(32) DEFAULT '发布微博的地理信息',
  `add_ts` bigint NOT NULL COMMENT '记录添加时间戳',
  `last_modify_ts` bigint NOT NULL COMMENT '记录最后修改时间戳',
  `note_id` varchar(64) NOT NULL COMMENT '帖子ID',
  `content` longtext COMMENT '帖子正文内容',
  `create_time` bigint NOT NULL COMMENT '帖子发布时间戳',
  `create_date_time` varchar(32) NOT NULL COMMENT '帖子发布日期时间',
  `liked_count` varchar(16) DEFAULT NULL COMMENT '帖子点赞数',
  `comments_count` varchar(16) DEFAULT NULL COMMENT '帖子评论数量',
  `shared_count` varchar(16) DEFAULT NULL COMMENT '帖子转发数量',
  `note_url` varchar(512) DEFAULT NULL COMMENT '帖子详情URL',
  PRIMARY KEY (`id`),
  KEY `idx_weibo_note_note_id_f95b1a` (`note_id`),
  KEY `idx_weibo_note_create__692709` (`create_time`),
  KEY `idx_weibo_note_create__d05ed2` (`create_date_time`)
 ) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_0900_ai_ci COMMENT='微博帖子';
 -- ----------------------------
 -- Table structure for weibo_note_comment
 -- ----------------------------
 DROP TABLE IF EXISTS `weibo_note_comment`;
 CREATE TABLE `weibo_note_comment` (
  `id` int NOT NULL AUTO_INCREMENT COMMENT '自增ID',
  `user_id` varchar(64) DEFAULT NULL COMMENT '用户ID',
  `nickname` varchar(64) DEFAULT NULL COMMENT '用户昵称',
  `avatar` varchar(255) DEFAULT NULL COMMENT '用户头像地址',
  `gender` varchar(12) DEFAULT NULL COMMENT '用户性别',
  `profile_url` varchar(255) DEFAULT NULL COMMENT '用户主页地址',
  `ip_location` varchar(32) DEFAULT '发布微博的地理信息',
  `add_ts` bigint NOT NULL COMMENT '记录添加时间戳',
  `last_modify_ts` bigint NOT NULL COMMENT '记录最后修改时间戳',
  `comment_id` varchar(64) NOT NULL COMMENT '评论ID',
  `note_id` varchar(64) NOT NULL COMMENT '帖子ID',
  `content` longtext COMMENT '评论内容',
  `create_time` bigint NOT NULL COMMENT '评论时间戳',
  `create_date_time` varchar(32) NOT NULL COMMENT '评论日期时间',
  `comment_like_count` varchar(16) NOT NULL COMMENT '评论点赞数量',
  `sub_comment_count` varchar(16) NOT NULL COMMENT '评论回复数',
  PRIMARY KEY (`id`),
  KEY `idx_weibo_note__comment_c7611c` (`comment_id`),
  KEY `idx_weibo_note__note_id_24f108` (`note_id`),
  KEY `idx_weibo_note__create__667fe3` (`create_date_time`)
 ) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_0900_ai_ci COMMENT='微博帖子评论';
 -- ----------------------------
 -- Table structure for xhs_creator
 -- ----------------------------
 DROP TABLE IF EXISTS `xhs_creator`;
 CREATE TABLE `xhs_creator` (
  `id` int NOT NULL AUTO_INCREMENT COMMENT '自增ID',
  `user_id` varchar(64) NOT NULL COMMENT '用户ID',
  `nickname` varchar(64) DEFAULT NULL COMMENT '用户昵称',
  `avatar` varchar(255) DEFAULT NULL COMMENT '用户头像地址',
  `ip_location` varchar(255) DEFAULT NULL COMMENT '评论时的IP地址',
  `add_ts` bigint NOT NULL COMMENT '记录添加时间戳',
  `last_modify_ts` bigint NOT NULL COMMENT '记录最后修改时间戳',
  `desc` longtext COMMENT '用户描述',
  `gender` varchar(1) DEFAULT NULL COMMENT '性别',
  `follows` varchar(16) DEFAULT NULL COMMENT '关注数',
  `fans` varchar(16) DEFAULT NULL COMMENT '粉丝数',
  `interaction` varchar(16) DEFAULT NULL COMMENT '获赞和收藏数',
  `tag_list` longtext COMMENT '标签列表',
  PRIMARY KEY (`id`)
 ) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_0900_ai_ci COMMENT='小红书博主';
 -- ----------------------------
 -- Table structure for xhs_note
 -- ----------------------------
 DROP TABLE IF EXISTS `xhs_note`;
 CREATE TABLE `xhs_note` (
  `id` int NOT NULL AUTO_INCREMENT COMMENT '自增ID',
  `user_id` varchar(64) NOT NULL COMMENT '用户ID',
  `nickname` varchar(64) DEFAULT NULL COMMENT '用户昵称',
  `avatar` varchar(255) DEFAULT NULL COMMENT '用户头像地址',
  `ip_location` varchar(255) DEFAULT NULL COMMENT '评论时的IP地址',
  `add_ts` bigint NOT NULL COMMENT '记录添加时间戳',
  `last_modify_ts` bigint NOT NULL COMMENT '记录最后修改时间戳',
  `note_id` varchar(64) NOT NULL COMMENT '笔记ID',
  `type` varchar(16) DEFAULT NULL COMMENT '笔记类型(normal | video)',
  `title` varchar(255) DEFAULT NULL COMMENT '笔记标题',
  `desc` longtext COMMENT '笔记描述',
  `video_url` longtext COMMENT '视频地址',
  `time` bigint NOT NULL COMMENT '笔记发布时间戳',
  `last_update_time` bigint NOT NULL COMMENT '笔记最后更新时间戳',
  `liked_count` varchar(16) DEFAULT NULL COMMENT '笔记点赞数',
  `collected_count` varchar(16) DEFAULT NULL COMMENT '笔记收藏数',
  `comment_count` varchar(16) DEFAULT NULL COMMENT '笔记评论数',
  `share_count` varchar(16) DEFAULT NULL COMMENT '笔记分享数',
  `image_list` longtext COMMENT '笔记封面图片列表',
  `tag_list` longtext COMMENT '标签列表',
  `note_url` varchar(255) DEFAULT NULL COMMENT '笔记详情页的URL',
  PRIMARY KEY (`id`),
  KEY `idx_xhs_note_note_id_209457` (`note_id`),
  KEY `idx_xhs_note_time_eaa910` (`time`)
 ) ENGINE=InnoDB AUTO_INCREMENT=1 DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_0900_ai_ci COMMENT='小红书笔记';
 -- ----------------------------
 -- Table structure for xhs_note_comment
 -- ----------------------------
 DROP TABLE IF EXISTS `xhs_note_comment`;
 CREATE TABLE `xhs_note_comment` (
  `id` int NOT NULL AUTO_INCREMENT COMMENT '自增ID',
  `user_id` varchar(64) NOT NULL COMMENT '用户ID',
  `nickname` varchar(64) DEFAULT NULL COMMENT '用户昵称',
  `avatar` varchar(255) DEFAULT NULL COMMENT '用户头像地址',
  `ip_location` varchar(255) DEFAULT NULL COMMENT '评论时的IP地址',
  `add_ts` bigint NOT NULL COMMENT '记录添加时间戳',
  `last_modify_ts` bigint NOT NULL COMMENT '记录最后修改时间戳',
  `comment_id` varchar(64) NOT NULL COMMENT '评论ID',
  `create_time` bigint NOT NULL COMMENT '评论时间戳',
  `note_id` varchar(64) NOT NULL COMMENT '笔记ID',
  `content` longtext NOT NULL COMMENT '评论内容',
  `sub_comment_count` int NOT NULL COMMENT '子评论数量',
  `pictures` varchar(512) DEFAULT NULL,
  PRIMARY KEY (`id`),
  KEY `idx_xhs_note_co_comment_8e8349` (`comment_id`),
  KEY `idx_xhs_note_co_create__204f8d` (`create_time`)
 ) ENGINE=InnoDB AUTO_INCREMENT=1 DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_0900_ai_ci COMMENT='小红书笔记评论';
 -- ----------------------------
 -- alter table xhs_note_comment to support parent_comment_id
 -- ----------------------------
 ALTER TABLE `xhs_note_comment`
 ADD COLUMN `parent_comment_id` VARCHAR(64) DEFAULT NULL COMMENT '父评论ID';
 ALTER TABLE `douyin_aweme_comment`
 ADD COLUMN `parent_comment_id` VARCHAR(64) DEFAULT NULL COMMENT '父评论ID';
 ALTER TABLE `bilibili_video_comment`
 ADD COLUMN `parent_comment_id` VARCHAR(64) DEFAULT NULL COMMENT '父评论ID';
 SET FOREIGN_KEY_CHECKS = 1;
--- a/static/images/11群二维码.JPG
+++ b/static/images/11群二维码.JPG
--- a/static/images/IP_提取图.png
+++ b/static/images/IP_提取图.png
--- a/static/images/img.png
+++ b/static/images/img.png
--- a/static/images/img_1.png
+++ b/static/images/img_1.png
--- a/static/images/img_2.png
+++ b/static/images/img_2.png
--- a/static/images/img_3.png
+++ b/static/images/img_3.png
--- a/static/images/img_4.png
+++ b/static/images/img_4.png
--- a/static/images/relakkes_weichat.JPG
+++ b/static/images/relakkes_weichat.JPG
--- a/static/images/wechat_pay.jpeg
+++ b/static/images/wechat_pay.jpeg
--- a/static/images/xingqiu.jpg
+++ b/static/images/xingqiu.jpg
--- a/static/images/zfb_pay.png
+++ b/static/images/zfb_pay.png
--- a/static/images/代理IP
+++ b/static/images/代理IP
--- a/static/images/修改代理密钥.png
+++ b/static/images/修改代理密钥.png
--- a/store/init.py
+++ b/store/init.py
@ -0,0 +1,4 @@
 # -*- coding: utf-8 -*-
 # @Author  : relakkes@gmail.com
 # @Time    : 2024/1/14 17:29
 # @Desc    :
--- a/store/bilibili/init.py
+++ b/store/bilibili/init.py
@ -0,0 +1,82 @@
 # -*- coding: utf-8 -*-
 # @Author  : relakkes@gmail.com
 # @Time    : 2024/1/14 19:34
 # @Desc    :
 from typing import List
 import config
 from .bilibili_store_impl import *
 class BiliStoreFactory:
    STORES = {
        "csv": BiliCsvStoreImplement,
        "db": BiliDbStoreImplement,
        "json": BiliJsonStoreImplement
    }
    @staticmethod
    def create_store() -> AbstractStore:
        store_class = BiliStoreFactory.STORES.get(config.SAVE_DATA_OPTION)
        if not store_class:
            raise ValueError(
                "[BiliStoreFactory.create_store] Invalid save option only supported csv or db or json ...")
        return store_class()
 async def update_bilibili_video(video_item: Dict):
    video_item_view: Dict = video_item.get("View")
    video_user_info: Dict = video_item_view.get("owner")
    video_item_stat: Dict = video_item_view.get("stat")
    video_id = str(video_item_view.get("aid"))
    save_content_item = {
        "video_id": video_id,
        "video_type": "video",
        "title": video_item_view.get("title", "")[:500],
        "desc": video_item_view.get("desc", "")[:500],
        "create_time": video_item_view.get("pubdate"),
        "user_id": str(video_user_info.get("mid")),
        "nickname": video_user_info.get("name"),
        "avatar": video_user_info.get("face", ""),
        "liked_count": str(video_item_stat.get("like", "")),
        "video_play_count": str(video_item_stat.get("view", "")),
        "video_danmaku": str(video_item_stat.get("danmaku", "")),
        "video_comment": str(video_item_stat.get("reply", "")),
        "last_modify_ts": utils.get_current_timestamp(),
        "video_url": f"https://www.bilibili.com/video/av{video_id}",
        "video_cover_url": video_item_view.get("pic", ""),
    }
    utils.logger.info(
        f"[store.bilibili.update_bilibili_video] bilibili video id:{video_id}, title:{save_content_item.get('title')}")
    await BiliStoreFactory.create_store().store_content(content_item=save_content_item)
 async def batch_update_bilibili_video_comments(video_id: str, comments: List[Dict]):
    if not comments:
        return
    for comment_item in comments:
        await update_bilibili_video_comment(video_id, comment_item)
 async def update_bilibili_video_comment(video_id: str, comment_item: Dict):
    comment_id = str(comment_item.get("rpid"))
    parent_comment_id = str(comment_item.get("parent", 0))
    content: Dict = comment_item.get("content")
    user_info: Dict = comment_item.get("member")
    save_comment_item = {
        "comment_id": comment_id,
        "parent_comment_id": parent_comment_id,
        "create_time": comment_item.get("ctime"),
        "video_id": str(video_id),
        "content": content.get("message"),
        "user_id": user_info.get("mid"),
        "nickname": user_info.get("uname"),
        "avatar": user_info.get("avatar"),
        "sub_comment_count": str(comment_item.get("rcount", 0)),
        "last_modify_ts": utils.get_current_timestamp(),
    }
    utils.logger.info(
        f"[store.bilibili.update_bilibili_video_comment] Bilibili video comment: {comment_id}, content: {save_comment_item.get('content')}")
    await BiliStoreFactory.create_store().store_comment(comment_item=save_comment_item)
--- a/store/bilibili/bilibili_store_impl.py
+++ b/store/bilibili/bilibili_store_impl.py
@ -0,0 +1,206 @@
 # -*- coding: utf-8 -*-
 # @Author  : relakkes@gmail.com
 # @Time    : 2024/1/14 19:34
 # @Desc    : B站存储实现类
 import asyncio
 import csv
 import json
 import os
 import pathlib
 from typing import Dict
 import aiofiles
 import config
 from base.base_crawler import AbstractStore
 from tools import utils, words
 from var import crawler_type_var
 def calculate_number_of_files(file_store_path: str) -> int:
    """计算数据保存文件的前部分排序数字，支持每次运行代码不写到同一个文件中
    Args:
        file_store_path;
    Returns:
        file nums
    """
    if not os.path.exists(file_store_path):
        return 1
    try:
        return max([int(file_name.split("_")[0])for file_name in os.listdir(file_store_path)])+1
    except ValueError:
        return 1
 class BiliCsvStoreImplement(AbstractStore):
    csv_store_path: str = "data/bilibili"
    file_count:int=calculate_number_of_files(csv_store_path)
    def make_save_file_name(self, store_type: str) -> str:
        """
        make save file name by store type
        Args:
            store_type: contents or comments
        Returns: eg: data/bilibili/search_comments_20240114.csv ...
        """
        return f"{self.csv_store_path}/{self.file_count}_{crawler_type_var.get()}_{store_type}_{utils.get_current_date()}.csv"
    async def save_data_to_csv(self, save_item: Dict, store_type: str):
        """
        Below is a simple way to save it in CSV format.
        Args:
            save_item:  save content dict info
            store_type: Save type contains content and comments（contents | comments）
        Returns: no returns
        """
        pathlib.Path(self.csv_store_path).mkdir(parents=True, exist_ok=True)
        save_file_name = self.make_save_file_name(store_type=store_type)
        async with aiofiles.open(save_file_name, mode='a+', encoding="utf-8-sig", newline="") as f:
            writer = csv.writer(f)
            if await f.tell() == 0:
                await writer.writerow(save_item.keys())
            await writer.writerow(save_item.values())
    async def store_content(self, content_item: Dict):
        """
        Bilibili content CSV storage implementation
        Args:
            content_item: note item dict
        Returns:
        """
        await self.save_data_to_csv(save_item=content_item, store_type="contents")
    async def store_comment(self, comment_item: Dict):
        """
        Bilibili comment CSV storage implementation
        Args:
            comment_item: comment item dict
        Returns:
        """
        await self.save_data_to_csv(save_item=comment_item, store_type="comments")
 class BiliDbStoreImplement(AbstractStore):
    async def store_content(self, content_item: Dict):
        """
        Bilibili content DB storage implementation
        Args:
            content_item: content item dict
        Returns:
        """
        from .bilibili_store_sql import (add_new_content,
                                         query_content_by_content_id,
                                         update_content_by_content_id)
        video_id = content_item.get("video_id")
        video_detail: Dict = await query_content_by_content_id(content_id=video_id)
        if not video_detail:
            content_item["add_ts"] = utils.get_current_timestamp()
            await add_new_content(content_item)
        else:
            await update_content_by_content_id(video_id, content_item=content_item)
    async def store_comment(self, comment_item: Dict):
        """
        Bilibili content DB storage implementation
        Args:
            comment_item: comment item dict
        Returns:
        """
        from .bilibili_store_sql import (add_new_comment,
                                         query_comment_by_comment_id,
                                         update_comment_by_comment_id)
        comment_id = comment_item.get("comment_id")
        comment_detail: Dict = await query_comment_by_comment_id(comment_id=comment_id)
        if not comment_detail:
            comment_item["add_ts"] = utils.get_current_timestamp()
            await add_new_comment(comment_item)
        else:
            await update_comment_by_comment_id(comment_id, comment_item=comment_item)
 class BiliJsonStoreImplement(AbstractStore):
    json_store_path: str = "data/bilibili/json"
    words_store_path: str = "data/bilibili/words"
    lock = asyncio.Lock()
    file_count:int=calculate_number_of_files(json_store_path)
    WordCloud = words.AsyncWordCloudGenerator()
    def make_save_file_name(self, store_type: str) -> (str,str):
        """
        make save file name by store type
        Args:
            store_type: Save type contains content and comments（contents | comments）
        Returns:
        """
        return (
            f"{self.json_store_path}/{crawler_type_var.get()}_{store_type}_{utils.get_current_date()}.json",
            f"{self.words_store_path}/{crawler_type_var.get()}_{store_type}_{utils.get_current_date()}"
        )
    async def save_data_to_json(self, save_item: Dict, store_type: str):
        """
        Below is a simple way to save it in json format.
        Args:
            save_item: save content dict info
            store_type: Save type contains content and comments（contents | comments）
        Returns:
        """
        pathlib.Path(self.json_store_path).mkdir(parents=True, exist_ok=True)
        pathlib.Path(self.words_store_path).mkdir(parents=True, exist_ok=True)
        save_file_name,words_file_name_prefix = self.make_save_file_name(store_type=store_type)
        save_data = []
        async with self.lock:
            if os.path.exists(save_file_name):
                async with aiofiles.open(save_file_name, 'r', encoding='utf-8') as file:
                    save_data = json.loads(await file.read())
            save_data.append(save_item)
            async with aiofiles.open(save_file_name, 'w', encoding='utf-8') as file:
                await file.write(json.dumps(save_data, ensure_ascii=False))
            if config.ENABLE_GET_COMMENTS and config.ENABLE_GET_WORDCLOUD:
                try:
                    await self.WordCloud.generate_word_frequency_and_cloud(save_data, words_file_name_prefix)
                except:
                    pass
    async def store_content(self, content_item: Dict):
        """
        content JSON storage implementation
        Args:
            content_item:
        Returns:
        """
        await self.save_data_to_json(content_item, "contents")
    async def store_comment(self, comment_item: Dict):
        """
        comment JSON storage implementatio
        Args:
            comment_item:
        Returns:
        """
        await self.save_data_to_json(comment_item, "comments")
--- a/Show more
+++ b/Show more
		`@ -0,0 +1,2 @@`
							`from .base_config import *`
							`from .db_config import *`
		`@ -0,0 +1,2 @@`
							`# -- coding: utf-8 --`
							`from .core import KuaishouCrawler`
		`@ -0,0 +1,2 @@`
							`from .core import XiaoHongShuCrawler`
							`from .field import *`