Files
openclaw-backups/skills/aidotnet-web-scraper/SKILL.md

158 lines
6.2 KiB
Markdown

---
name: web-scraper
description: Scrape web pages, search the internet, and extract structured content using Python. Use when the user wants to fetch a webpage, search for information online, extract links, or crawl JavaScript-rendered dynamic pages.
compatibility: Requires Python 3. Lightweight mode needs requests, beautifulsoup4, readability-lxml, html2text. Dynamic mode needs crawl4ai. Search needs duckduckgo-search.
---
# Web Scraper
Fetch, search, and extract content from websites.
## When to use this skill
- User asks to fetch or read a webpage / URL
- User wants to search the internet for information
- User needs to extract links, tables, or structured data from a website
- User asks to crawl a JavaScript-rendered (dynamic) page
- User wants web content converted to clean Markdown for analysis
## Scripts overview
| Script | Purpose | Dependencies |
|---|---|---|
| `fetch_page.py` | Fetch a URL and extract readable content as Markdown | `requests`, `beautifulsoup4`, `readability-lxml`, `html2text` |
| `search_web.py` | Search the web via DuckDuckGo | `ddgs` |
| `crawl_dynamic.py` | Crawl JS-rendered pages with a headless browser | `crawl4ai` |
| `extract_links.py` | Extract and categorize all links from a page | `requests`, `beautifulsoup4` |
## Steps
### 1. Install dependencies (first time only)
For lightweight scraping (static pages, search, link extraction):
```bash
pip install requests beautifulsoup4 readability-lxml html2text ddgs
```
For dynamic / JavaScript-rendered pages (heavier, installs Playwright + Chromium):
```bash
pip install crawl4ai
crawl4ai-setup
```
> **Note**: `crawl4ai-setup` downloads a Chromium browser (~150 MB). Only install if you actually need dynamic page support.
> **CRITICAL — Dependency Error Recovery**: If ANY script below fails with an `ImportError` or "module not found" error, install the missing dependencies using the command above, then **re-run the EXACT SAME script command that failed**. Do NOT write inline Python code (`python -c "..."`) or your own ad-hoc scripts as a substitute. These scripts handle encoding, error handling, and output formatting that inline code will miss.
### 2. Fetch a web page (static — recommended first choice)
Use this for most websites. It's fast, lightweight, and works for articles, docs, blogs, etc.
```bash
python scripts/fetch_page.py "URL"
```
Options:
- `--raw` — Output full page Markdown instead of extracted article content
- `--selector "CSS_SELECTOR"` — Extract only elements matching the CSS selector (e.g. `".article-body"`, `"table"`, `"#content"`)
- `--save OUTPUT_PATH` — Also save output to a file
- `--max-length N` — Truncate output to N characters (default: no limit)
Examples:
```bash
# Fetch an article
python fetch_page.py "https://example.com/article"
# Extract only tables
python fetch_page.py "https://example.com/data" --selector "table"
# Fetch raw full-page markdown, limit to 5000 chars
python fetch_page.py "https://example.com" --raw --max-length 5000
```
### 3. Search the web
Search using DuckDuckGo (no API key required).
```bash
python scripts/search_web.py "search query"
```
Options:
- `--max-results N` — Number of results to return (default: 10)
- `--region REGION` — Region code, e.g. `cn-zh`, `us-en`, `jp-jp` (default: `wt-wt` for worldwide)
- `--news` — Search news instead of general web
Examples:
```bash
# General search
python search_web.py "Python web scraping best practices 2025"
# News search, Chinese region, 5 results
python search_web.py "AI 最新进展" --news --region cn-zh --max-results 5
```
### 4. Crawl a dynamic / JavaScript-rendered page
Use this only when `fetch_page.py` returns empty or incomplete content (SPA, React/Vue apps, pages that load content via JS).
```bash
python scripts/crawl_dynamic.py "URL"
```
Options:
- `--wait N` — Wait N seconds after page load for JS to finish (default: 3)
- `--selector "CSS_SELECTOR"` — Wait for a specific element to appear before extracting
- `--scroll` — Scroll to bottom of page to trigger lazy loading
- `--save OUTPUT_PATH` — Also save output to a file
- `--max-length N` — Truncate output to N characters
### 5. Extract links from a page
Extract all links with their text labels, categorized by type (internal, external, resource).
```bash
python scripts/extract_links.py "URL"
```
Options:
- `--filter PATTERN` — Only show links matching a regex pattern (applied to URL)
- `--external-only` — Only show external links
- `--json` — Output as JSON instead of Markdown
## Decision guide: which script to use
1. **Start with `fetch_page.py`** — handles 90% of websites (articles, docs, blogs, wikis).
2. If `fetch_page.py` returns empty/garbled content → try **`crawl_dynamic.py`** (the page likely needs JavaScript).
3. Need to find URLs first? → Use **`search_web.py`** to discover relevant pages.
4. Need to navigate a site structure? → Use **`extract_links.py`** to map out links, then fetch individual pages.
## Common workflows
### Research a topic
1. `search_web.py "topic"` → get relevant URLs
2. `fetch_page.py "best_url"` → read the content
3. Repeat for multiple sources, then synthesize
### Scrape structured data from a page
1. `fetch_page.py "url" --selector "table"` → extract tables
2. Or `fetch_page.py "url" --selector ".product-card"` → extract specific elements
### Crawl a modern web app (SPA)
1. `crawl_dynamic.py "url" --wait 5 --scroll` → full JS-rendered content
## Edge cases
- **Paywalled sites**: May return partial content or login pages. Inform the user.
- **Rate limiting / CAPTCHAs**: If requests fail with 403/429, wait and retry or inform the user.
- **Very large pages**: Use `--max-length` to truncate output and avoid overwhelming the context window.
- **Encoding issues**: Scripts handle UTF-8 by default. Exotic encodings may need manual adjustment.
- **Robots.txt**: These scripts do not check robots.txt. Use responsibly and respect website terms of service.
## Scripts
- [fetch_page.py](scripts/fetch_page.py) — Fetch and extract readable content as Markdown
- [search_web.py](scripts/search_web.py) — Search the web via DuckDuckGo
- [crawl_dynamic.py](scripts/crawl_dynamic.py) — Crawl JavaScript-rendered pages
- [extract_links.py](scripts/extract_links.py) — Extract and categorize page links