AI Newsletter Digest improvements: fixed QP soft line break decoding, URL extraction, and content cleaning
This commit is contained in:
157
skills/aidotnet-web-scraper/SKILL.md
Normal file
157
skills/aidotnet-web-scraper/SKILL.md
Normal file
@@ -0,0 +1,157 @@
|
||||
---
|
||||
name: web-scraper
|
||||
description: Scrape web pages, search the internet, and extract structured content using Python. Use when the user wants to fetch a webpage, search for information online, extract links, or crawl JavaScript-rendered dynamic pages.
|
||||
compatibility: Requires Python 3. Lightweight mode needs requests, beautifulsoup4, readability-lxml, html2text. Dynamic mode needs crawl4ai. Search needs duckduckgo-search.
|
||||
---
|
||||
|
||||
# Web Scraper
|
||||
|
||||
Fetch, search, and extract content from websites.
|
||||
|
||||
## When to use this skill
|
||||
|
||||
- User asks to fetch or read a webpage / URL
|
||||
- User wants to search the internet for information
|
||||
- User needs to extract links, tables, or structured data from a website
|
||||
- User asks to crawl a JavaScript-rendered (dynamic) page
|
||||
- User wants web content converted to clean Markdown for analysis
|
||||
|
||||
## Scripts overview
|
||||
|
||||
| Script | Purpose | Dependencies |
|
||||
|---|---|---|
|
||||
| `fetch_page.py` | Fetch a URL and extract readable content as Markdown | `requests`, `beautifulsoup4`, `readability-lxml`, `html2text` |
|
||||
| `search_web.py` | Search the web via DuckDuckGo | `ddgs` |
|
||||
| `crawl_dynamic.py` | Crawl JS-rendered pages with a headless browser | `crawl4ai` |
|
||||
| `extract_links.py` | Extract and categorize all links from a page | `requests`, `beautifulsoup4` |
|
||||
|
||||
## Steps
|
||||
|
||||
### 1. Install dependencies (first time only)
|
||||
|
||||
For lightweight scraping (static pages, search, link extraction):
|
||||
```bash
|
||||
pip install requests beautifulsoup4 readability-lxml html2text ddgs
|
||||
```
|
||||
|
||||
For dynamic / JavaScript-rendered pages (heavier, installs Playwright + Chromium):
|
||||
```bash
|
||||
pip install crawl4ai
|
||||
crawl4ai-setup
|
||||
```
|
||||
|
||||
> **Note**: `crawl4ai-setup` downloads a Chromium browser (~150 MB). Only install if you actually need dynamic page support.
|
||||
|
||||
> **CRITICAL — Dependency Error Recovery**: If ANY script below fails with an `ImportError` or "module not found" error, install the missing dependencies using the command above, then **re-run the EXACT SAME script command that failed**. Do NOT write inline Python code (`python -c "..."`) or your own ad-hoc scripts as a substitute. These scripts handle encoding, error handling, and output formatting that inline code will miss.
|
||||
|
||||
### 2. Fetch a web page (static — recommended first choice)
|
||||
|
||||
Use this for most websites. It's fast, lightweight, and works for articles, docs, blogs, etc.
|
||||
|
||||
```bash
|
||||
python scripts/fetch_page.py "URL"
|
||||
```
|
||||
|
||||
Options:
|
||||
- `--raw` — Output full page Markdown instead of extracted article content
|
||||
- `--selector "CSS_SELECTOR"` — Extract only elements matching the CSS selector (e.g. `".article-body"`, `"table"`, `"#content"`)
|
||||
- `--save OUTPUT_PATH` — Also save output to a file
|
||||
- `--max-length N` — Truncate output to N characters (default: no limit)
|
||||
|
||||
Examples:
|
||||
```bash
|
||||
# Fetch an article
|
||||
python fetch_page.py "https://example.com/article"
|
||||
|
||||
# Extract only tables
|
||||
python fetch_page.py "https://example.com/data" --selector "table"
|
||||
|
||||
# Fetch raw full-page markdown, limit to 5000 chars
|
||||
python fetch_page.py "https://example.com" --raw --max-length 5000
|
||||
```
|
||||
|
||||
### 3. Search the web
|
||||
|
||||
Search using DuckDuckGo (no API key required).
|
||||
|
||||
```bash
|
||||
python scripts/search_web.py "search query"
|
||||
```
|
||||
|
||||
Options:
|
||||
- `--max-results N` — Number of results to return (default: 10)
|
||||
- `--region REGION` — Region code, e.g. `cn-zh`, `us-en`, `jp-jp` (default: `wt-wt` for worldwide)
|
||||
- `--news` — Search news instead of general web
|
||||
|
||||
Examples:
|
||||
```bash
|
||||
# General search
|
||||
python search_web.py "Python web scraping best practices 2025"
|
||||
|
||||
# News search, Chinese region, 5 results
|
||||
python search_web.py "AI 最新进展" --news --region cn-zh --max-results 5
|
||||
```
|
||||
|
||||
### 4. Crawl a dynamic / JavaScript-rendered page
|
||||
|
||||
Use this only when `fetch_page.py` returns empty or incomplete content (SPA, React/Vue apps, pages that load content via JS).
|
||||
|
||||
```bash
|
||||
python scripts/crawl_dynamic.py "URL"
|
||||
```
|
||||
|
||||
Options:
|
||||
- `--wait N` — Wait N seconds after page load for JS to finish (default: 3)
|
||||
- `--selector "CSS_SELECTOR"` — Wait for a specific element to appear before extracting
|
||||
- `--scroll` — Scroll to bottom of page to trigger lazy loading
|
||||
- `--save OUTPUT_PATH` — Also save output to a file
|
||||
- `--max-length N` — Truncate output to N characters
|
||||
|
||||
### 5. Extract links from a page
|
||||
|
||||
Extract all links with their text labels, categorized by type (internal, external, resource).
|
||||
|
||||
```bash
|
||||
python scripts/extract_links.py "URL"
|
||||
```
|
||||
|
||||
Options:
|
||||
- `--filter PATTERN` — Only show links matching a regex pattern (applied to URL)
|
||||
- `--external-only` — Only show external links
|
||||
- `--json` — Output as JSON instead of Markdown
|
||||
|
||||
## Decision guide: which script to use
|
||||
|
||||
1. **Start with `fetch_page.py`** — handles 90% of websites (articles, docs, blogs, wikis).
|
||||
2. If `fetch_page.py` returns empty/garbled content → try **`crawl_dynamic.py`** (the page likely needs JavaScript).
|
||||
3. Need to find URLs first? → Use **`search_web.py`** to discover relevant pages.
|
||||
4. Need to navigate a site structure? → Use **`extract_links.py`** to map out links, then fetch individual pages.
|
||||
|
||||
## Common workflows
|
||||
|
||||
### Research a topic
|
||||
1. `search_web.py "topic"` → get relevant URLs
|
||||
2. `fetch_page.py "best_url"` → read the content
|
||||
3. Repeat for multiple sources, then synthesize
|
||||
|
||||
### Scrape structured data from a page
|
||||
1. `fetch_page.py "url" --selector "table"` → extract tables
|
||||
2. Or `fetch_page.py "url" --selector ".product-card"` → extract specific elements
|
||||
|
||||
### Crawl a modern web app (SPA)
|
||||
1. `crawl_dynamic.py "url" --wait 5 --scroll` → full JS-rendered content
|
||||
|
||||
## Edge cases
|
||||
|
||||
- **Paywalled sites**: May return partial content or login pages. Inform the user.
|
||||
- **Rate limiting / CAPTCHAs**: If requests fail with 403/429, wait and retry or inform the user.
|
||||
- **Very large pages**: Use `--max-length` to truncate output and avoid overwhelming the context window.
|
||||
- **Encoding issues**: Scripts handle UTF-8 by default. Exotic encodings may need manual adjustment.
|
||||
- **Robots.txt**: These scripts do not check robots.txt. Use responsibly and respect website terms of service.
|
||||
|
||||
## Scripts
|
||||
|
||||
- [fetch_page.py](scripts/fetch_page.py) — Fetch and extract readable content as Markdown
|
||||
- [search_web.py](scripts/search_web.py) — Search the web via DuckDuckGo
|
||||
- [crawl_dynamic.py](scripts/crawl_dynamic.py) — Crawl JavaScript-rendered pages
|
||||
- [extract_links.py](scripts/extract_links.py) — Extract and categorize page links
|
||||
Binary file not shown.
254
skills/aidotnet-web-scraper/scripts/crawl_dynamic.py
Normal file
254
skills/aidotnet-web-scraper/scripts/crawl_dynamic.py
Normal file
@@ -0,0 +1,254 @@
|
||||
#!/usr/bin/env python3
|
||||
"""Crawl JavaScript-rendered (dynamic) web pages using Crawl4AI.
|
||||
|
||||
Uses a headless Chromium browser to render pages that require JavaScript,
|
||||
then extracts clean Markdown content. Use this when fetch_page.py returns
|
||||
empty or incomplete content (SPAs, React/Vue apps, etc.).
|
||||
|
||||
Dependencies: pip install crawl4ai && crawl4ai-setup
|
||||
"""
|
||||
|
||||
import sys
|
||||
import argparse
|
||||
import asyncio
|
||||
|
||||
|
||||
def setup_encoding():
|
||||
"""Setup proper encoding for Windows console output."""
|
||||
if sys.platform == "win32":
|
||||
import io
|
||||
try:
|
||||
sys.stdout.reconfigure(encoding='utf-8', errors='replace')
|
||||
sys.stderr.reconfigure(encoding='utf-8', errors='replace')
|
||||
except (AttributeError, io.UnsupportedOperation):
|
||||
sys.stdout = io.TextIOWrapper(
|
||||
sys.stdout.buffer, encoding='utf-8', errors='replace', line_buffering=True
|
||||
)
|
||||
sys.stderr = io.TextIOWrapper(
|
||||
sys.stderr.buffer, encoding='utf-8', errors='replace', line_buffering=True
|
||||
)
|
||||
|
||||
|
||||
def check_dependencies():
|
||||
"""Check that Crawl4AI is installed."""
|
||||
try:
|
||||
import crawl4ai # noqa: F401
|
||||
except ImportError:
|
||||
print("Error: crawl4ai not installed.", file=sys.stderr)
|
||||
print("Install with:", file=sys.stderr)
|
||||
print(" pip install crawl4ai", file=sys.stderr)
|
||||
print(" crawl4ai-setup", file=sys.stderr)
|
||||
sys.exit(1)
|
||||
|
||||
|
||||
async def crawl_page(url, wait_seconds=3, css_selector=None, scroll=False):
|
||||
"""Crawl a page with headless browser and return Markdown content."""
|
||||
from crawl4ai import AsyncWebCrawler, BrowserConfig, CrawlerRunConfig, CacheMode
|
||||
from crawl4ai.content_filter_strategy import PruningContentFilter
|
||||
from crawl4ai.markdown_generation_strategy import DefaultMarkdownGenerator
|
||||
|
||||
browser_conf = BrowserConfig(
|
||||
headless=True,
|
||||
verbose=False,
|
||||
)
|
||||
|
||||
md_generator = DefaultMarkdownGenerator(
|
||||
content_filter=PruningContentFilter(threshold=0.4, threshold_type="fixed")
|
||||
)
|
||||
|
||||
run_conf = CrawlerRunConfig(
|
||||
cache_mode=CacheMode.BYPASS,
|
||||
markdown_generator=md_generator,
|
||||
page_timeout=60000, # 60s
|
||||
wait_until="networkidle",
|
||||
)
|
||||
|
||||
# Add wait time if specified
|
||||
if wait_seconds > 0:
|
||||
run_conf.delay_before_return_html = wait_seconds
|
||||
|
||||
# Wait for specific CSS selector
|
||||
if css_selector:
|
||||
run_conf.wait_for = f"css:{css_selector}"
|
||||
|
||||
async with AsyncWebCrawler(config=browser_conf) as crawler:
|
||||
result = await crawler.arun(url=url, config=run_conf)
|
||||
|
||||
if not result.success:
|
||||
return None, result.error_message or "Unknown error"
|
||||
|
||||
# Get the best available markdown
|
||||
md = ""
|
||||
if result.markdown:
|
||||
if hasattr(result.markdown, 'fit_markdown') and result.markdown.fit_markdown:
|
||||
md = result.markdown.fit_markdown
|
||||
elif hasattr(result.markdown, 'raw_markdown') and result.markdown.raw_markdown:
|
||||
md = result.markdown.raw_markdown
|
||||
elif isinstance(result.markdown, str):
|
||||
md = result.markdown
|
||||
|
||||
title = ""
|
||||
if hasattr(result, 'metadata') and result.metadata:
|
||||
title = result.metadata.get('title', '')
|
||||
|
||||
return {
|
||||
"title": title,
|
||||
"url": result.url or url,
|
||||
"markdown": md,
|
||||
"status_code": getattr(result, 'status_code', None),
|
||||
}, None
|
||||
|
||||
|
||||
async def crawl_with_scroll(url, wait_seconds=3, css_selector=None):
|
||||
"""Crawl with infinite scroll support."""
|
||||
from crawl4ai import AsyncWebCrawler, BrowserConfig, CrawlerRunConfig, CacheMode
|
||||
from crawl4ai.markdown_generation_strategy import DefaultMarkdownGenerator
|
||||
from crawl4ai.content_filter_strategy import PruningContentFilter
|
||||
|
||||
browser_conf = BrowserConfig(
|
||||
headless=True,
|
||||
verbose=False,
|
||||
)
|
||||
|
||||
# JavaScript to scroll to bottom
|
||||
scroll_js = """
|
||||
async () => {
|
||||
await new Promise((resolve) => {
|
||||
let totalHeight = 0;
|
||||
const distance = 500;
|
||||
const timer = setInterval(() => {
|
||||
const scrollHeight = document.body.scrollHeight;
|
||||
window.scrollBy(0, distance);
|
||||
totalHeight += distance;
|
||||
if (totalHeight >= scrollHeight) {
|
||||
clearInterval(timer);
|
||||
resolve();
|
||||
}
|
||||
}, 300);
|
||||
// Safety timeout
|
||||
setTimeout(() => { clearInterval(timer); resolve(); }, 15000);
|
||||
});
|
||||
}
|
||||
"""
|
||||
|
||||
md_generator = DefaultMarkdownGenerator(
|
||||
content_filter=PruningContentFilter(threshold=0.4, threshold_type="fixed")
|
||||
)
|
||||
|
||||
run_conf = CrawlerRunConfig(
|
||||
cache_mode=CacheMode.BYPASS,
|
||||
markdown_generator=md_generator,
|
||||
page_timeout=60000,
|
||||
js_code=scroll_js,
|
||||
wait_until="networkidle",
|
||||
)
|
||||
|
||||
if wait_seconds > 0:
|
||||
run_conf.delay_before_return_html = wait_seconds
|
||||
|
||||
if css_selector:
|
||||
run_conf.wait_for = f"css:{css_selector}"
|
||||
|
||||
async with AsyncWebCrawler(config=browser_conf) as crawler:
|
||||
result = await crawler.arun(url=url, config=run_conf)
|
||||
|
||||
if not result.success:
|
||||
return None, result.error_message or "Unknown error"
|
||||
|
||||
md = ""
|
||||
if result.markdown:
|
||||
if hasattr(result.markdown, 'fit_markdown') and result.markdown.fit_markdown:
|
||||
md = result.markdown.fit_markdown
|
||||
elif hasattr(result.markdown, 'raw_markdown') and result.markdown.raw_markdown:
|
||||
md = result.markdown.raw_markdown
|
||||
elif isinstance(result.markdown, str):
|
||||
md = result.markdown
|
||||
|
||||
title = ""
|
||||
if hasattr(result, 'metadata') and result.metadata:
|
||||
title = result.metadata.get('title', '')
|
||||
|
||||
return {
|
||||
"title": title,
|
||||
"url": result.url or url,
|
||||
"markdown": md,
|
||||
"status_code": getattr(result, 'status_code', None),
|
||||
}, None
|
||||
|
||||
|
||||
def main():
|
||||
setup_encoding()
|
||||
check_dependencies()
|
||||
|
||||
parser = argparse.ArgumentParser(
|
||||
description="Crawl JavaScript-rendered pages with headless browser"
|
||||
)
|
||||
parser.add_argument("url", help="URL to crawl")
|
||||
parser.add_argument("--wait", type=int, default=3,
|
||||
help="Seconds to wait after page load (default: 3)")
|
||||
parser.add_argument("--selector", type=str, default=None,
|
||||
help="CSS selector to wait for before extracting")
|
||||
parser.add_argument("--scroll", action="store_true",
|
||||
help="Scroll to bottom to trigger lazy loading")
|
||||
parser.add_argument("--save", type=str, default=None,
|
||||
help="Also save output to this file path")
|
||||
parser.add_argument("--max-length", type=int, default=None,
|
||||
help="Truncate output to N characters")
|
||||
|
||||
args = parser.parse_args()
|
||||
|
||||
url = args.url.strip()
|
||||
if not url.startswith(("http://", "https://")):
|
||||
url = "https://" + url
|
||||
|
||||
print(f"Crawling (dynamic): {url}", file=sys.stderr)
|
||||
print(f"Options: wait={args.wait}s, selector={args.selector}, scroll={args.scroll}", file=sys.stderr)
|
||||
|
||||
# Run async crawl
|
||||
if args.scroll:
|
||||
data, error = asyncio.run(crawl_with_scroll(url, args.wait, args.selector))
|
||||
else:
|
||||
data, error = asyncio.run(crawl_page(url, args.wait, args.selector))
|
||||
|
||||
if error:
|
||||
print(f"Error: crawl failed: {error}", file=sys.stderr)
|
||||
sys.exit(1)
|
||||
|
||||
if not data or not data["markdown"]:
|
||||
print("Warning: no content extracted from page", file=sys.stderr)
|
||||
print("[No content could be extracted from this page]")
|
||||
sys.exit(0)
|
||||
|
||||
# Build output
|
||||
parts = []
|
||||
if data["title"]:
|
||||
parts.append(f"# {data['title']}\n")
|
||||
parts.append(f"**Source**: {data['url']}")
|
||||
if data.get("status_code"):
|
||||
parts.append(f"**Status**: {data['status_code']}")
|
||||
parts.append("\n---\n")
|
||||
parts.append(data["markdown"])
|
||||
|
||||
output = "\n".join(parts)
|
||||
|
||||
# Truncate if requested
|
||||
if args.max_length and len(output) > args.max_length:
|
||||
output = output[:args.max_length] + f"\n\n[... truncated at {args.max_length} characters, total {len(output)}]"
|
||||
|
||||
print(output)
|
||||
|
||||
content_len = len(data["markdown"])
|
||||
print(f"\nExtracted: {content_len} characters (dynamic crawl)", file=sys.stderr)
|
||||
|
||||
# Save to file if requested
|
||||
if args.save:
|
||||
try:
|
||||
with open(args.save, "w", encoding="utf-8") as f:
|
||||
f.write(output)
|
||||
print(f"Saved to: {args.save}", file=sys.stderr)
|
||||
except Exception as e:
|
||||
print(f"Error saving file: {e}", file=sys.stderr)
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
main()
|
||||
236
skills/aidotnet-web-scraper/scripts/extract_links.py
Normal file
236
skills/aidotnet-web-scraper/scripts/extract_links.py
Normal file
@@ -0,0 +1,236 @@
|
||||
#!/usr/bin/env python3
|
||||
"""Extract and categorize all links from a web page.
|
||||
|
||||
Fetches the page and extracts all <a> tags, categorizing them as
|
||||
internal, external, or resource links. Useful for site navigation
|
||||
and discovery before deeper scraping.
|
||||
|
||||
Dependencies: pip install requests beautifulsoup4
|
||||
"""
|
||||
|
||||
import sys
|
||||
import argparse
|
||||
import json
|
||||
import re
|
||||
from urllib.parse import urlparse, urljoin
|
||||
|
||||
|
||||
def setup_encoding():
|
||||
"""Setup proper encoding for Windows console output."""
|
||||
if sys.platform == "win32":
|
||||
import io
|
||||
try:
|
||||
sys.stdout.reconfigure(encoding='utf-8', errors='replace')
|
||||
sys.stderr.reconfigure(encoding='utf-8', errors='replace')
|
||||
except (AttributeError, io.UnsupportedOperation):
|
||||
sys.stdout = io.TextIOWrapper(
|
||||
sys.stdout.buffer, encoding='utf-8', errors='replace', line_buffering=True
|
||||
)
|
||||
sys.stderr = io.TextIOWrapper(
|
||||
sys.stderr.buffer, encoding='utf-8', errors='replace', line_buffering=True
|
||||
)
|
||||
|
||||
|
||||
def check_dependencies():
|
||||
"""Check that required packages are installed."""
|
||||
missing = []
|
||||
try:
|
||||
import requests # noqa: F401
|
||||
except ImportError:
|
||||
missing.append("requests")
|
||||
try:
|
||||
from bs4 import BeautifulSoup # noqa: F401
|
||||
except ImportError:
|
||||
missing.append("beautifulsoup4")
|
||||
if missing:
|
||||
print(f"Error: missing dependencies: {', '.join(missing)}", file=sys.stderr)
|
||||
print(f"Install with: pip install {' '.join(missing)}", file=sys.stderr)
|
||||
sys.exit(1)
|
||||
|
||||
|
||||
RESOURCE_EXTENSIONS = {
|
||||
'.pdf', '.doc', '.docx', '.xls', '.xlsx', '.ppt', '.pptx',
|
||||
'.zip', '.rar', '.tar', '.gz', '.7z',
|
||||
'.jpg', '.jpeg', '.png', '.gif', '.svg', '.webp', '.ico',
|
||||
'.mp3', '.mp4', '.avi', '.mov', '.webm',
|
||||
'.css', '.js', '.woff', '.woff2', '.ttf', '.eot',
|
||||
}
|
||||
|
||||
|
||||
def classify_link(href, base_domain):
|
||||
"""Classify a link as internal, external, or resource."""
|
||||
parsed = urlparse(href)
|
||||
|
||||
# Check for resource files
|
||||
path_lower = parsed.path.lower()
|
||||
for ext in RESOURCE_EXTENSIONS:
|
||||
if path_lower.endswith(ext):
|
||||
return "resource"
|
||||
|
||||
# Check domain
|
||||
link_domain = parsed.netloc.lower()
|
||||
if not link_domain or link_domain == base_domain:
|
||||
return "internal"
|
||||
|
||||
# Check for common CDN / same-org subdomains
|
||||
base_parts = base_domain.split(".")
|
||||
link_parts = link_domain.split(".")
|
||||
if len(base_parts) >= 2 and len(link_parts) >= 2:
|
||||
if base_parts[-2:] == link_parts[-2:]:
|
||||
return "internal"
|
||||
|
||||
return "external"
|
||||
|
||||
|
||||
def extract_links(html, base_url):
|
||||
"""Extract all links from HTML."""
|
||||
from bs4 import BeautifulSoup
|
||||
|
||||
soup = BeautifulSoup(html, "html.parser")
|
||||
base_domain = urlparse(base_url).netloc.lower()
|
||||
links = []
|
||||
seen = set()
|
||||
|
||||
for a_tag in soup.find_all("a", href=True):
|
||||
href = a_tag["href"].strip()
|
||||
|
||||
# Skip anchors, javascript:, mailto:, tel:
|
||||
if not href or href.startswith(("#", "javascript:", "mailto:", "tel:")):
|
||||
continue
|
||||
|
||||
# Resolve relative URLs
|
||||
full_url = urljoin(base_url, href)
|
||||
|
||||
# Deduplicate
|
||||
if full_url in seen:
|
||||
continue
|
||||
seen.add(full_url)
|
||||
|
||||
# Extract link text
|
||||
text = a_tag.get_text(strip=True) or ""
|
||||
text = re.sub(r'\s+', ' ', text) # normalize whitespace
|
||||
if len(text) > 100:
|
||||
text = text[:100] + "..."
|
||||
|
||||
link_type = classify_link(full_url, base_domain)
|
||||
|
||||
links.append({
|
||||
"url": full_url,
|
||||
"text": text,
|
||||
"type": link_type,
|
||||
})
|
||||
|
||||
return links
|
||||
|
||||
|
||||
def format_markdown(links, url, filter_pattern=None, external_only=False):
|
||||
"""Format links as Markdown."""
|
||||
# Apply filters
|
||||
filtered = links
|
||||
if external_only:
|
||||
filtered = [link for link in filtered if link["type"] == "external"]
|
||||
if filter_pattern:
|
||||
try:
|
||||
pattern = re.compile(filter_pattern, re.IGNORECASE)
|
||||
filtered = [link for link in filtered if pattern.search(link["url"])]
|
||||
except re.error as e:
|
||||
print(f"Warning: invalid regex pattern '{filter_pattern}': {e}", file=sys.stderr)
|
||||
|
||||
# Group by type
|
||||
internal = [link for link in filtered if link["type"] == "internal"]
|
||||
external = [link for link in filtered if link["type"] == "external"]
|
||||
resources = [link for link in filtered if link["type"] == "resource"]
|
||||
|
||||
parts = [f"# Links from {url}\n"]
|
||||
parts.append(f"Total: **{len(filtered)}** links ({len(internal)} internal, {len(external)} external, {len(resources)} resource)\n")
|
||||
|
||||
if internal:
|
||||
parts.append("## Internal Links\n")
|
||||
for lk in internal:
|
||||
text = f" — {lk['text']}" if lk['text'] else ""
|
||||
parts.append(f"- {lk['url']}{text}")
|
||||
parts.append("")
|
||||
|
||||
if external:
|
||||
parts.append("## External Links\n")
|
||||
for lk in external:
|
||||
text = f" — {lk['text']}" if lk['text'] else ""
|
||||
parts.append(f"- {lk['url']}{text}")
|
||||
parts.append("")
|
||||
|
||||
if resources:
|
||||
parts.append("## Resource Links\n")
|
||||
for lk in resources:
|
||||
text = f" — {lk['text']}" if lk['text'] else ""
|
||||
parts.append(f"- {lk['url']}{text}")
|
||||
parts.append("")
|
||||
|
||||
return "\n".join(parts)
|
||||
|
||||
|
||||
def main():
|
||||
setup_encoding()
|
||||
check_dependencies()
|
||||
|
||||
parser = argparse.ArgumentParser(
|
||||
description="Extract and categorize links from a web page"
|
||||
)
|
||||
parser.add_argument("url", help="URL to extract links from")
|
||||
parser.add_argument("--filter", type=str, default=None,
|
||||
help="Regex pattern to filter URLs")
|
||||
parser.add_argument("--external-only", action="store_true",
|
||||
help="Only show external links")
|
||||
parser.add_argument("--json", action="store_true",
|
||||
help="Output as JSON instead of Markdown")
|
||||
parser.add_argument("--timeout", type=int, default=30,
|
||||
help="Request timeout in seconds (default: 30)")
|
||||
|
||||
args = parser.parse_args()
|
||||
|
||||
import requests
|
||||
|
||||
url = args.url.strip()
|
||||
if not url.startswith(("http://", "https://")):
|
||||
url = "https://" + url
|
||||
|
||||
print(f"Extracting links from: {url}", file=sys.stderr)
|
||||
|
||||
headers = {
|
||||
"User-Agent": (
|
||||
"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 "
|
||||
"(KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36"
|
||||
),
|
||||
}
|
||||
|
||||
try:
|
||||
resp = requests.get(url, headers=headers, timeout=args.timeout, allow_redirects=True)
|
||||
resp.raise_for_status()
|
||||
if resp.encoding and resp.encoding.lower() != 'utf-8':
|
||||
resp.encoding = resp.apparent_encoding or resp.encoding
|
||||
html = resp.text
|
||||
final_url = resp.url
|
||||
except requests.exceptions.RequestException as e:
|
||||
print(f"Error: {e}", file=sys.stderr)
|
||||
sys.exit(1)
|
||||
|
||||
links = extract_links(html, final_url)
|
||||
print(f"Found {len(links)} unique links", file=sys.stderr)
|
||||
|
||||
if args.json:
|
||||
# Apply filters for JSON output too
|
||||
filtered = links
|
||||
if args.external_only:
|
||||
filtered = [lk for lk in filtered if lk["type"] == "external"]
|
||||
if args.filter:
|
||||
try:
|
||||
pattern = re.compile(args.filter, re.IGNORECASE)
|
||||
filtered = [lk for lk in filtered if pattern.search(lk["url"])]
|
||||
except re.error:
|
||||
pass
|
||||
print(json.dumps(filtered, indent=2, ensure_ascii=False))
|
||||
else:
|
||||
print(format_markdown(links, final_url, args.filter, args.external_only))
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
main()
|
||||
284
skills/aidotnet-web-scraper/scripts/fetch_page.py
Normal file
284
skills/aidotnet-web-scraper/scripts/fetch_page.py
Normal file
@@ -0,0 +1,284 @@
|
||||
#!/usr/bin/env python3
|
||||
"""Fetch a web page and extract readable content as clean Markdown.
|
||||
|
||||
Uses requests + BeautifulSoup + readability-lxml + html2text for lightweight,
|
||||
fast extraction without a headless browser. Works well for articles, docs,
|
||||
blogs, wikis, and most static websites.
|
||||
|
||||
Dependencies: pip install requests beautifulsoup4 readability-lxml html2text
|
||||
"""
|
||||
|
||||
import sys
|
||||
import argparse
|
||||
|
||||
|
||||
def setup_encoding():
|
||||
"""Setup proper encoding for Windows console output."""
|
||||
if sys.platform == "win32":
|
||||
import io
|
||||
try:
|
||||
sys.stdout.reconfigure(encoding='utf-8', errors='replace')
|
||||
sys.stderr.reconfigure(encoding='utf-8', errors='replace')
|
||||
except (AttributeError, io.UnsupportedOperation):
|
||||
sys.stdout = io.TextIOWrapper(
|
||||
sys.stdout.buffer, encoding='utf-8', errors='replace', line_buffering=True
|
||||
)
|
||||
sys.stderr = io.TextIOWrapper(
|
||||
sys.stderr.buffer, encoding='utf-8', errors='replace', line_buffering=True
|
||||
)
|
||||
|
||||
|
||||
def check_dependencies():
|
||||
"""Check that required packages are installed."""
|
||||
missing = []
|
||||
try:
|
||||
import requests # noqa: F401
|
||||
except ImportError:
|
||||
missing.append("requests")
|
||||
try:
|
||||
from bs4 import BeautifulSoup # noqa: F401
|
||||
except ImportError:
|
||||
missing.append("beautifulsoup4")
|
||||
try:
|
||||
from readability import Document # noqa: F401
|
||||
except ImportError:
|
||||
missing.append("readability-lxml")
|
||||
try:
|
||||
import html2text # noqa: F401
|
||||
except ImportError:
|
||||
missing.append("html2text")
|
||||
|
||||
if missing:
|
||||
print(f"Error: missing dependencies: {', '.join(missing)}", file=sys.stderr)
|
||||
print(f"Install with: pip install {' '.join(missing)}", file=sys.stderr)
|
||||
sys.exit(1)
|
||||
|
||||
|
||||
def fetch_url(url, timeout=30):
|
||||
"""Fetch URL content with proper headers."""
|
||||
import requests
|
||||
|
||||
headers = {
|
||||
"User-Agent": (
|
||||
"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 "
|
||||
"(KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36"
|
||||
),
|
||||
"Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8",
|
||||
"Accept-Language": "en-US,en;q=0.9,zh-CN;q=0.8,zh;q=0.7",
|
||||
"Accept-Encoding": "gzip, deflate, br",
|
||||
}
|
||||
|
||||
try:
|
||||
resp = requests.get(url, headers=headers, timeout=timeout, allow_redirects=True)
|
||||
resp.raise_for_status()
|
||||
|
||||
# Detect encoding
|
||||
if resp.encoding and resp.encoding.lower() != 'utf-8':
|
||||
resp.encoding = resp.apparent_encoding or resp.encoding
|
||||
|
||||
return resp.text, resp.url, resp.status_code
|
||||
except requests.exceptions.Timeout:
|
||||
print(f"Error: request timed out after {timeout}s", file=sys.stderr)
|
||||
sys.exit(1)
|
||||
except requests.exceptions.ConnectionError as e:
|
||||
print(f"Error: connection failed: {e}", file=sys.stderr)
|
||||
sys.exit(1)
|
||||
except requests.exceptions.HTTPError as e:
|
||||
print(f"Error: HTTP {e.response.status_code}: {e}", file=sys.stderr)
|
||||
sys.exit(1)
|
||||
except Exception as e:
|
||||
print(f"Error: {e}", file=sys.stderr)
|
||||
sys.exit(1)
|
||||
|
||||
|
||||
def extract_with_readability(html, url):
|
||||
"""Extract main article content using readability-lxml."""
|
||||
from readability import Document
|
||||
|
||||
doc = Document(html, url=url)
|
||||
title = doc.short_title()
|
||||
content_html = doc.summary()
|
||||
return title, content_html
|
||||
|
||||
|
||||
def extract_with_selector(html, selector):
|
||||
"""Extract content matching a CSS selector."""
|
||||
from bs4 import BeautifulSoup
|
||||
|
||||
soup = BeautifulSoup(html, "html.parser")
|
||||
elements = soup.select(selector)
|
||||
if not elements:
|
||||
return None
|
||||
|
||||
# Combine all matching elements
|
||||
parts = []
|
||||
for el in elements:
|
||||
parts.append(str(el))
|
||||
return "\n".join(parts)
|
||||
|
||||
|
||||
def html_to_markdown(html, base_url=None):
|
||||
"""Convert HTML to clean Markdown."""
|
||||
import html2text
|
||||
|
||||
converter = html2text.HTML2Text()
|
||||
converter.body_width = 0 # Don't wrap lines
|
||||
converter.ignore_images = False
|
||||
converter.ignore_links = False
|
||||
converter.ignore_emphasis = False
|
||||
converter.protect_links = True
|
||||
converter.unicode_snob = True
|
||||
converter.mark_code = True
|
||||
converter.wrap_links = False
|
||||
converter.single_line_break = False
|
||||
|
||||
if base_url:
|
||||
converter.baseurl = base_url
|
||||
|
||||
md = converter.handle(html)
|
||||
|
||||
# Clean up excessive blank lines
|
||||
import re
|
||||
md = re.sub(r'\n{3,}', '\n\n', md)
|
||||
return md.strip()
|
||||
|
||||
|
||||
def extract_metadata(html):
|
||||
"""Extract page metadata (title, description, etc.)."""
|
||||
from bs4 import BeautifulSoup
|
||||
|
||||
soup = BeautifulSoup(html, "html.parser")
|
||||
meta = {}
|
||||
|
||||
# Title
|
||||
title_tag = soup.find("title")
|
||||
if title_tag:
|
||||
meta["title"] = title_tag.get_text(strip=True)
|
||||
|
||||
# Meta description
|
||||
desc_tag = soup.find("meta", attrs={"name": "description"})
|
||||
if desc_tag and desc_tag.get("content"):
|
||||
meta["description"] = desc_tag["content"].strip()
|
||||
|
||||
# OG tags
|
||||
for prop in ["og:title", "og:description", "og:type", "og:site_name"]:
|
||||
tag = soup.find("meta", attrs={"property": prop})
|
||||
if tag and tag.get("content"):
|
||||
meta[prop.replace("og:", "og_")] = tag["content"].strip()
|
||||
|
||||
# Author
|
||||
author_tag = soup.find("meta", attrs={"name": "author"})
|
||||
if author_tag and author_tag.get("content"):
|
||||
meta["author"] = author_tag["content"].strip()
|
||||
|
||||
# Published date
|
||||
for attr in ["article:published_time", "datePublished", "date"]:
|
||||
date_tag = soup.find("meta", attrs={"property": attr}) or soup.find("meta", attrs={"name": attr})
|
||||
if date_tag and date_tag.get("content"):
|
||||
meta["published"] = date_tag["content"].strip()
|
||||
break
|
||||
|
||||
return meta
|
||||
|
||||
|
||||
def main():
|
||||
setup_encoding()
|
||||
check_dependencies()
|
||||
|
||||
parser = argparse.ArgumentParser(
|
||||
description="Fetch a web page and extract content as Markdown"
|
||||
)
|
||||
parser.add_argument("url", help="URL to fetch")
|
||||
parser.add_argument("--raw", action="store_true",
|
||||
help="Output full page Markdown (no readability extraction)")
|
||||
parser.add_argument("--selector", type=str, default=None,
|
||||
help="CSS selector to extract specific elements")
|
||||
parser.add_argument("--save", type=str, default=None,
|
||||
help="Also save output to this file path")
|
||||
parser.add_argument("--max-length", type=int, default=None,
|
||||
help="Truncate output to N characters")
|
||||
parser.add_argument("--timeout", type=int, default=30,
|
||||
help="Request timeout in seconds (default: 30)")
|
||||
parser.add_argument("--no-metadata", action="store_true",
|
||||
help="Skip metadata header in output")
|
||||
|
||||
args = parser.parse_args()
|
||||
|
||||
# Normalize URL
|
||||
url = args.url.strip()
|
||||
if not url.startswith(("http://", "https://")):
|
||||
url = "https://" + url
|
||||
|
||||
print(f"Fetching: {url}", file=sys.stderr)
|
||||
|
||||
# Fetch
|
||||
html, final_url, status = fetch_url(url, timeout=args.timeout)
|
||||
print(f"Status: {status}, Size: {len(html)} bytes", file=sys.stderr)
|
||||
|
||||
if final_url != url:
|
||||
print(f"Redirected to: {final_url}", file=sys.stderr)
|
||||
|
||||
# Extract metadata
|
||||
meta = extract_metadata(html) if not args.no_metadata else {}
|
||||
|
||||
# Extract content
|
||||
if args.selector:
|
||||
# CSS selector mode
|
||||
selected_html = extract_with_selector(html, args.selector)
|
||||
if not selected_html:
|
||||
print(f"Warning: no elements matched selector '{args.selector}'", file=sys.stderr)
|
||||
print(f"[No elements matched CSS selector: {args.selector}]")
|
||||
sys.exit(0)
|
||||
title = meta.get("title", "")
|
||||
content_md = html_to_markdown(selected_html, base_url=final_url)
|
||||
elif args.raw:
|
||||
# Raw full-page mode
|
||||
title = meta.get("title", "")
|
||||
content_md = html_to_markdown(html, base_url=final_url)
|
||||
else:
|
||||
# Readability extraction mode (default)
|
||||
title, article_html = extract_with_readability(html, final_url)
|
||||
content_md = html_to_markdown(article_html, base_url=final_url)
|
||||
|
||||
# Build output
|
||||
parts = []
|
||||
|
||||
if not args.no_metadata and meta:
|
||||
parts.append(f"# {title or meta.get('title', 'Untitled')}")
|
||||
parts.append(f"\n**Source**: {final_url}")
|
||||
if meta.get("author"):
|
||||
parts.append(f"**Author**: {meta['author']}")
|
||||
if meta.get("published"):
|
||||
parts.append(f"**Published**: {meta['published']}")
|
||||
if meta.get("description"):
|
||||
parts.append(f"**Description**: {meta['description']}")
|
||||
parts.append("\n---\n")
|
||||
elif title and not args.no_metadata:
|
||||
parts.append(f"# {title}\n")
|
||||
|
||||
parts.append(content_md)
|
||||
|
||||
output = "\n".join(parts)
|
||||
|
||||
# Truncate if requested
|
||||
if args.max_length and len(output) > args.max_length:
|
||||
output = output[:args.max_length] + f"\n\n[... truncated at {args.max_length} characters, total {len(output)}]"
|
||||
|
||||
# Print to stdout
|
||||
print(output)
|
||||
|
||||
content_length = len(content_md)
|
||||
print(f"\nExtracted: {content_length} characters", file=sys.stderr)
|
||||
|
||||
# Save to file if requested
|
||||
if args.save:
|
||||
try:
|
||||
with open(args.save, "w", encoding="utf-8") as f:
|
||||
f.write(output)
|
||||
print(f"Saved to: {args.save}", file=sys.stderr)
|
||||
except Exception as e:
|
||||
print(f"Error saving file: {e}", file=sys.stderr)
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
main()
|
||||
155
skills/aidotnet-web-scraper/scripts/search_web.py
Normal file
155
skills/aidotnet-web-scraper/scripts/search_web.py
Normal file
@@ -0,0 +1,155 @@
|
||||
#!/usr/bin/env python3
|
||||
"""Search the web using DuckDuckGo and return structured results.
|
||||
|
||||
No API key required. Returns results with title, URL, and snippet.
|
||||
|
||||
Dependencies: pip install ddgs
|
||||
"""
|
||||
|
||||
import sys
|
||||
import argparse
|
||||
import json
|
||||
|
||||
|
||||
def setup_encoding():
|
||||
"""Setup proper encoding for Windows console output."""
|
||||
if sys.platform == "win32":
|
||||
import io
|
||||
try:
|
||||
sys.stdout.reconfigure(encoding='utf-8', errors='replace')
|
||||
sys.stderr.reconfigure(encoding='utf-8', errors='replace')
|
||||
except (AttributeError, io.UnsupportedOperation):
|
||||
sys.stdout = io.TextIOWrapper(
|
||||
sys.stdout.buffer, encoding='utf-8', errors='replace', line_buffering=True
|
||||
)
|
||||
sys.stderr = io.TextIOWrapper(
|
||||
sys.stderr.buffer, encoding='utf-8', errors='replace', line_buffering=True
|
||||
)
|
||||
|
||||
|
||||
def check_dependencies():
|
||||
"""Check that required packages are installed."""
|
||||
try:
|
||||
from ddgs import DDGS # noqa: F401
|
||||
except ImportError:
|
||||
try:
|
||||
from duckduckgo_search import DDGS # noqa: F401
|
||||
except ImportError:
|
||||
print("Error: ddgs not installed.", file=sys.stderr)
|
||||
print("Install with: pip install ddgs", file=sys.stderr)
|
||||
sys.exit(1)
|
||||
|
||||
|
||||
def _get_ddgs_class():
|
||||
"""Import DDGS from ddgs (new) or duckduckgo_search (legacy)."""
|
||||
try:
|
||||
from ddgs import DDGS
|
||||
return DDGS
|
||||
except ImportError:
|
||||
from duckduckgo_search import DDGS
|
||||
return DDGS
|
||||
|
||||
|
||||
def search_text(query, max_results=10, region="wt-wt"):
|
||||
"""Perform a text search."""
|
||||
DDGS = _get_ddgs_class()
|
||||
ddgs = DDGS()
|
||||
try:
|
||||
# New ddgs package: positional 'query' arg
|
||||
results = list(ddgs.text(query, region=region, max_results=max_results))
|
||||
except TypeError:
|
||||
# Legacy duckduckgo_search: 'keywords' kwarg + context manager
|
||||
with DDGS() as d:
|
||||
results = list(d.text(keywords=query, region=region, max_results=max_results))
|
||||
return results
|
||||
|
||||
|
||||
def search_news(query, max_results=10, region="wt-wt"):
|
||||
"""Perform a news search."""
|
||||
DDGS = _get_ddgs_class()
|
||||
ddgs = DDGS()
|
||||
try:
|
||||
results = list(ddgs.news(query, region=region, max_results=max_results))
|
||||
except TypeError:
|
||||
with DDGS() as d:
|
||||
results = list(d.news(keywords=query, region=region, max_results=max_results))
|
||||
return results
|
||||
|
||||
|
||||
def format_results_markdown(results, query, is_news=False):
|
||||
"""Format search results as Markdown."""
|
||||
search_type = "News" if is_news else "Web"
|
||||
parts = [f"# {search_type} Search Results: {query}\n"]
|
||||
parts.append(f"Found **{len(results)}** results.\n")
|
||||
|
||||
for i, r in enumerate(results, 1):
|
||||
title = r.get("title", "Untitled")
|
||||
url = r.get("href") or r.get("url") or r.get("link", "")
|
||||
body = r.get("body") or r.get("snippet", "")
|
||||
date = r.get("date", "")
|
||||
|
||||
parts.append(f"## {i}. {title}")
|
||||
parts.append(f"**URL**: {url}")
|
||||
if date:
|
||||
parts.append(f"**Date**: {date}")
|
||||
if body:
|
||||
parts.append(f"\n{body}")
|
||||
parts.append("") # blank line
|
||||
|
||||
return "\n".join(parts)
|
||||
|
||||
|
||||
def format_results_json(results):
|
||||
"""Format search results as JSON."""
|
||||
return json.dumps(results, indent=2, ensure_ascii=False)
|
||||
|
||||
|
||||
def main():
|
||||
setup_encoding()
|
||||
check_dependencies()
|
||||
|
||||
parser = argparse.ArgumentParser(
|
||||
description="Search the web via DuckDuckGo"
|
||||
)
|
||||
parser.add_argument("query", help="Search query")
|
||||
parser.add_argument("--max-results", type=int, default=10,
|
||||
help="Number of results (default: 10)")
|
||||
parser.add_argument("--region", type=str, default="wt-wt",
|
||||
help="Region code, e.g. cn-zh, us-en, jp-jp (default: wt-wt)")
|
||||
parser.add_argument("--news", action="store_true",
|
||||
help="Search news instead of general web")
|
||||
parser.add_argument("--json", action="store_true",
|
||||
help="Output as JSON instead of Markdown")
|
||||
|
||||
args = parser.parse_args()
|
||||
|
||||
query = args.query.strip()
|
||||
if not query:
|
||||
print("Error: empty query", file=sys.stderr)
|
||||
sys.exit(1)
|
||||
|
||||
print(f"Searching: {query} (region={args.region}, max={args.max_results})", file=sys.stderr)
|
||||
|
||||
try:
|
||||
if args.news:
|
||||
results = search_news(query, args.max_results, args.region)
|
||||
else:
|
||||
results = search_text(query, args.max_results, args.region)
|
||||
except Exception as e:
|
||||
print(f"Error: search failed: {e}", file=sys.stderr)
|
||||
sys.exit(1)
|
||||
|
||||
if not results:
|
||||
print(f"No results found for: {query}")
|
||||
sys.exit(0)
|
||||
|
||||
print(f"Got {len(results)} results", file=sys.stderr)
|
||||
|
||||
if args.json:
|
||||
print(format_results_json(results))
|
||||
else:
|
||||
print(format_results_markdown(results, query, is_news=args.news))
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
main()
|
||||
Reference in New Issue
Block a user