6.2 KiB
name, description, compatibility
| name | description | compatibility |
|---|---|---|
| web-scraper | Scrape web pages, search the internet, and extract structured content using Python. Use when the user wants to fetch a webpage, search for information online, extract links, or crawl JavaScript-rendered dynamic pages. | Requires Python 3. Lightweight mode needs requests, beautifulsoup4, readability-lxml, html2text. Dynamic mode needs crawl4ai. Search needs duckduckgo-search. |
Web Scraper
Fetch, search, and extract content from websites.
When to use this skill
- User asks to fetch or read a webpage / URL
- User wants to search the internet for information
- User needs to extract links, tables, or structured data from a website
- User asks to crawl a JavaScript-rendered (dynamic) page
- User wants web content converted to clean Markdown for analysis
Scripts overview
| Script | Purpose | Dependencies |
|---|---|---|
fetch_page.py |
Fetch a URL and extract readable content as Markdown | requests, beautifulsoup4, readability-lxml, html2text |
search_web.py |
Search the web via DuckDuckGo | ddgs |
crawl_dynamic.py |
Crawl JS-rendered pages with a headless browser | crawl4ai |
extract_links.py |
Extract and categorize all links from a page | requests, beautifulsoup4 |
Steps
1. Install dependencies (first time only)
For lightweight scraping (static pages, search, link extraction):
pip install requests beautifulsoup4 readability-lxml html2text ddgs
For dynamic / JavaScript-rendered pages (heavier, installs Playwright + Chromium):
pip install crawl4ai
crawl4ai-setup
Note
:
crawl4ai-setupdownloads a Chromium browser (~150 MB). Only install if you actually need dynamic page support.
CRITICAL — Dependency Error Recovery: If ANY script below fails with an
ImportErroror "module not found" error, install the missing dependencies using the command above, then re-run the EXACT SAME script command that failed. Do NOT write inline Python code (python -c "...") or your own ad-hoc scripts as a substitute. These scripts handle encoding, error handling, and output formatting that inline code will miss.
2. Fetch a web page (static — recommended first choice)
Use this for most websites. It's fast, lightweight, and works for articles, docs, blogs, etc.
python scripts/fetch_page.py "URL"
Options:
--raw— Output full page Markdown instead of extracted article content--selector "CSS_SELECTOR"— Extract only elements matching the CSS selector (e.g.".article-body","table","#content")--save OUTPUT_PATH— Also save output to a file--max-length N— Truncate output to N characters (default: no limit)
Examples:
# Fetch an article
python fetch_page.py "https://example.com/article"
# Extract only tables
python fetch_page.py "https://example.com/data" --selector "table"
# Fetch raw full-page markdown, limit to 5000 chars
python fetch_page.py "https://example.com" --raw --max-length 5000
3. Search the web
Search using DuckDuckGo (no API key required).
python scripts/search_web.py "search query"
Options:
--max-results N— Number of results to return (default: 10)--region REGION— Region code, e.g.cn-zh,us-en,jp-jp(default:wt-wtfor worldwide)--news— Search news instead of general web
Examples:
# General search
python search_web.py "Python web scraping best practices 2025"
# News search, Chinese region, 5 results
python search_web.py "AI 最新进展" --news --region cn-zh --max-results 5
4. Crawl a dynamic / JavaScript-rendered page
Use this only when fetch_page.py returns empty or incomplete content (SPA, React/Vue apps, pages that load content via JS).
python scripts/crawl_dynamic.py "URL"
Options:
--wait N— Wait N seconds after page load for JS to finish (default: 3)--selector "CSS_SELECTOR"— Wait for a specific element to appear before extracting--scroll— Scroll to bottom of page to trigger lazy loading--save OUTPUT_PATH— Also save output to a file--max-length N— Truncate output to N characters
5. Extract links from a page
Extract all links with their text labels, categorized by type (internal, external, resource).
python scripts/extract_links.py "URL"
Options:
--filter PATTERN— Only show links matching a regex pattern (applied to URL)--external-only— Only show external links--json— Output as JSON instead of Markdown
Decision guide: which script to use
- Start with
fetch_page.py— handles 90% of websites (articles, docs, blogs, wikis). - If
fetch_page.pyreturns empty/garbled content → trycrawl_dynamic.py(the page likely needs JavaScript). - Need to find URLs first? → Use
search_web.pyto discover relevant pages. - Need to navigate a site structure? → Use
extract_links.pyto map out links, then fetch individual pages.
Common workflows
Research a topic
search_web.py "topic"→ get relevant URLsfetch_page.py "best_url"→ read the content- Repeat for multiple sources, then synthesize
Scrape structured data from a page
fetch_page.py "url" --selector "table"→ extract tables- Or
fetch_page.py "url" --selector ".product-card"→ extract specific elements
Crawl a modern web app (SPA)
crawl_dynamic.py "url" --wait 5 --scroll→ full JS-rendered content
Edge cases
- Paywalled sites: May return partial content or login pages. Inform the user.
- Rate limiting / CAPTCHAs: If requests fail with 403/429, wait and retry or inform the user.
- Very large pages: Use
--max-lengthto truncate output and avoid overwhelming the context window. - Encoding issues: Scripts handle UTF-8 by default. Exotic encodings may need manual adjustment.
- Robots.txt: These scripts do not check robots.txt. Use responsibly and respect website terms of service.
Scripts
- fetch_page.py — Fetch and extract readable content as Markdown
- search_web.py — Search the web via DuckDuckGo
- crawl_dynamic.py — Crawl JavaScript-rendered pages
- extract_links.py — Extract and categorize page links