AI Newsletter Digest improvements: fixed QP soft line break decoding, URL extraction, and content cleaning

This commit is contained in:
Krilly
2026-03-04 13:29:22 +00:00
parent 29a98137a7
commit 57dd294675
13706 changed files with 2114953 additions and 237629 deletions

View File

@@ -0,0 +1,157 @@
---
name: web-scraper
description: Scrape web pages, search the internet, and extract structured content using Python. Use when the user wants to fetch a webpage, search for information online, extract links, or crawl JavaScript-rendered dynamic pages.
compatibility: Requires Python 3. Lightweight mode needs requests, beautifulsoup4, readability-lxml, html2text. Dynamic mode needs crawl4ai. Search needs duckduckgo-search.
---
# Web Scraper
Fetch, search, and extract content from websites.
## When to use this skill
- User asks to fetch or read a webpage / URL
- User wants to search the internet for information
- User needs to extract links, tables, or structured data from a website
- User asks to crawl a JavaScript-rendered (dynamic) page
- User wants web content converted to clean Markdown for analysis
## Scripts overview
| Script | Purpose | Dependencies |
|---|---|---|
| `fetch_page.py` | Fetch a URL and extract readable content as Markdown | `requests`, `beautifulsoup4`, `readability-lxml`, `html2text` |
| `search_web.py` | Search the web via DuckDuckGo | `ddgs` |
| `crawl_dynamic.py` | Crawl JS-rendered pages with a headless browser | `crawl4ai` |
| `extract_links.py` | Extract and categorize all links from a page | `requests`, `beautifulsoup4` |
## Steps
### 1. Install dependencies (first time only)
For lightweight scraping (static pages, search, link extraction):
```bash
pip install requests beautifulsoup4 readability-lxml html2text ddgs
```
For dynamic / JavaScript-rendered pages (heavier, installs Playwright + Chromium):
```bash
pip install crawl4ai
crawl4ai-setup
```
> **Note**: `crawl4ai-setup` downloads a Chromium browser (~150 MB). Only install if you actually need dynamic page support.
> **CRITICAL — Dependency Error Recovery**: If ANY script below fails with an `ImportError` or "module not found" error, install the missing dependencies using the command above, then **re-run the EXACT SAME script command that failed**. Do NOT write inline Python code (`python -c "..."`) or your own ad-hoc scripts as a substitute. These scripts handle encoding, error handling, and output formatting that inline code will miss.
### 2. Fetch a web page (static — recommended first choice)
Use this for most websites. It's fast, lightweight, and works for articles, docs, blogs, etc.
```bash
python scripts/fetch_page.py "URL"
```
Options:
- `--raw` — Output full page Markdown instead of extracted article content
- `--selector "CSS_SELECTOR"` — Extract only elements matching the CSS selector (e.g. `".article-body"`, `"table"`, `"#content"`)
- `--save OUTPUT_PATH` — Also save output to a file
- `--max-length N` — Truncate output to N characters (default: no limit)
Examples:
```bash
# Fetch an article
python fetch_page.py "https://example.com/article"
# Extract only tables
python fetch_page.py "https://example.com/data" --selector "table"
# Fetch raw full-page markdown, limit to 5000 chars
python fetch_page.py "https://example.com" --raw --max-length 5000
```
### 3. Search the web
Search using DuckDuckGo (no API key required).
```bash
python scripts/search_web.py "search query"
```
Options:
- `--max-results N` — Number of results to return (default: 10)
- `--region REGION` — Region code, e.g. `cn-zh`, `us-en`, `jp-jp` (default: `wt-wt` for worldwide)
- `--news` — Search news instead of general web
Examples:
```bash
# General search
python search_web.py "Python web scraping best practices 2025"
# News search, Chinese region, 5 results
python search_web.py "AI 最新进展" --news --region cn-zh --max-results 5
```
### 4. Crawl a dynamic / JavaScript-rendered page
Use this only when `fetch_page.py` returns empty or incomplete content (SPA, React/Vue apps, pages that load content via JS).
```bash
python scripts/crawl_dynamic.py "URL"
```
Options:
- `--wait N` — Wait N seconds after page load for JS to finish (default: 3)
- `--selector "CSS_SELECTOR"` — Wait for a specific element to appear before extracting
- `--scroll` — Scroll to bottom of page to trigger lazy loading
- `--save OUTPUT_PATH` — Also save output to a file
- `--max-length N` — Truncate output to N characters
### 5. Extract links from a page
Extract all links with their text labels, categorized by type (internal, external, resource).
```bash
python scripts/extract_links.py "URL"
```
Options:
- `--filter PATTERN` — Only show links matching a regex pattern (applied to URL)
- `--external-only` — Only show external links
- `--json` — Output as JSON instead of Markdown
## Decision guide: which script to use
1. **Start with `fetch_page.py`** — handles 90% of websites (articles, docs, blogs, wikis).
2. If `fetch_page.py` returns empty/garbled content → try **`crawl_dynamic.py`** (the page likely needs JavaScript).
3. Need to find URLs first? → Use **`search_web.py`** to discover relevant pages.
4. Need to navigate a site structure? → Use **`extract_links.py`** to map out links, then fetch individual pages.
## Common workflows
### Research a topic
1. `search_web.py "topic"` → get relevant URLs
2. `fetch_page.py "best_url"` → read the content
3. Repeat for multiple sources, then synthesize
### Scrape structured data from a page
1. `fetch_page.py "url" --selector "table"` → extract tables
2. Or `fetch_page.py "url" --selector ".product-card"` → extract specific elements
### Crawl a modern web app (SPA)
1. `crawl_dynamic.py "url" --wait 5 --scroll` → full JS-rendered content
## Edge cases
- **Paywalled sites**: May return partial content or login pages. Inform the user.
- **Rate limiting / CAPTCHAs**: If requests fail with 403/429, wait and retry or inform the user.
- **Very large pages**: Use `--max-length` to truncate output and avoid overwhelming the context window.
- **Encoding issues**: Scripts handle UTF-8 by default. Exotic encodings may need manual adjustment.
- **Robots.txt**: These scripts do not check robots.txt. Use responsibly and respect website terms of service.
## Scripts
- [fetch_page.py](scripts/fetch_page.py) — Fetch and extract readable content as Markdown
- [search_web.py](scripts/search_web.py) — Search the web via DuckDuckGo
- [crawl_dynamic.py](scripts/crawl_dynamic.py) — Crawl JavaScript-rendered pages
- [extract_links.py](scripts/extract_links.py) — Extract and categorize page links

View File

@@ -0,0 +1,254 @@
#!/usr/bin/env python3
"""Crawl JavaScript-rendered (dynamic) web pages using Crawl4AI.
Uses a headless Chromium browser to render pages that require JavaScript,
then extracts clean Markdown content. Use this when fetch_page.py returns
empty or incomplete content (SPAs, React/Vue apps, etc.).
Dependencies: pip install crawl4ai && crawl4ai-setup
"""
import sys
import argparse
import asyncio
def setup_encoding():
"""Setup proper encoding for Windows console output."""
if sys.platform == "win32":
import io
try:
sys.stdout.reconfigure(encoding='utf-8', errors='replace')
sys.stderr.reconfigure(encoding='utf-8', errors='replace')
except (AttributeError, io.UnsupportedOperation):
sys.stdout = io.TextIOWrapper(
sys.stdout.buffer, encoding='utf-8', errors='replace', line_buffering=True
)
sys.stderr = io.TextIOWrapper(
sys.stderr.buffer, encoding='utf-8', errors='replace', line_buffering=True
)
def check_dependencies():
"""Check that Crawl4AI is installed."""
try:
import crawl4ai # noqa: F401
except ImportError:
print("Error: crawl4ai not installed.", file=sys.stderr)
print("Install with:", file=sys.stderr)
print(" pip install crawl4ai", file=sys.stderr)
print(" crawl4ai-setup", file=sys.stderr)
sys.exit(1)
async def crawl_page(url, wait_seconds=3, css_selector=None, scroll=False):
"""Crawl a page with headless browser and return Markdown content."""
from crawl4ai import AsyncWebCrawler, BrowserConfig, CrawlerRunConfig, CacheMode
from crawl4ai.content_filter_strategy import PruningContentFilter
from crawl4ai.markdown_generation_strategy import DefaultMarkdownGenerator
browser_conf = BrowserConfig(
headless=True,
verbose=False,
)
md_generator = DefaultMarkdownGenerator(
content_filter=PruningContentFilter(threshold=0.4, threshold_type="fixed")
)
run_conf = CrawlerRunConfig(
cache_mode=CacheMode.BYPASS,
markdown_generator=md_generator,
page_timeout=60000, # 60s
wait_until="networkidle",
)
# Add wait time if specified
if wait_seconds > 0:
run_conf.delay_before_return_html = wait_seconds
# Wait for specific CSS selector
if css_selector:
run_conf.wait_for = f"css:{css_selector}"
async with AsyncWebCrawler(config=browser_conf) as crawler:
result = await crawler.arun(url=url, config=run_conf)
if not result.success:
return None, result.error_message or "Unknown error"
# Get the best available markdown
md = ""
if result.markdown:
if hasattr(result.markdown, 'fit_markdown') and result.markdown.fit_markdown:
md = result.markdown.fit_markdown
elif hasattr(result.markdown, 'raw_markdown') and result.markdown.raw_markdown:
md = result.markdown.raw_markdown
elif isinstance(result.markdown, str):
md = result.markdown
title = ""
if hasattr(result, 'metadata') and result.metadata:
title = result.metadata.get('title', '')
return {
"title": title,
"url": result.url or url,
"markdown": md,
"status_code": getattr(result, 'status_code', None),
}, None
async def crawl_with_scroll(url, wait_seconds=3, css_selector=None):
"""Crawl with infinite scroll support."""
from crawl4ai import AsyncWebCrawler, BrowserConfig, CrawlerRunConfig, CacheMode
from crawl4ai.markdown_generation_strategy import DefaultMarkdownGenerator
from crawl4ai.content_filter_strategy import PruningContentFilter
browser_conf = BrowserConfig(
headless=True,
verbose=False,
)
# JavaScript to scroll to bottom
scroll_js = """
async () => {
await new Promise((resolve) => {
let totalHeight = 0;
const distance = 500;
const timer = setInterval(() => {
const scrollHeight = document.body.scrollHeight;
window.scrollBy(0, distance);
totalHeight += distance;
if (totalHeight >= scrollHeight) {
clearInterval(timer);
resolve();
}
}, 300);
// Safety timeout
setTimeout(() => { clearInterval(timer); resolve(); }, 15000);
});
}
"""
md_generator = DefaultMarkdownGenerator(
content_filter=PruningContentFilter(threshold=0.4, threshold_type="fixed")
)
run_conf = CrawlerRunConfig(
cache_mode=CacheMode.BYPASS,
markdown_generator=md_generator,
page_timeout=60000,
js_code=scroll_js,
wait_until="networkidle",
)
if wait_seconds > 0:
run_conf.delay_before_return_html = wait_seconds
if css_selector:
run_conf.wait_for = f"css:{css_selector}"
async with AsyncWebCrawler(config=browser_conf) as crawler:
result = await crawler.arun(url=url, config=run_conf)
if not result.success:
return None, result.error_message or "Unknown error"
md = ""
if result.markdown:
if hasattr(result.markdown, 'fit_markdown') and result.markdown.fit_markdown:
md = result.markdown.fit_markdown
elif hasattr(result.markdown, 'raw_markdown') and result.markdown.raw_markdown:
md = result.markdown.raw_markdown
elif isinstance(result.markdown, str):
md = result.markdown
title = ""
if hasattr(result, 'metadata') and result.metadata:
title = result.metadata.get('title', '')
return {
"title": title,
"url": result.url or url,
"markdown": md,
"status_code": getattr(result, 'status_code', None),
}, None
def main():
setup_encoding()
check_dependencies()
parser = argparse.ArgumentParser(
description="Crawl JavaScript-rendered pages with headless browser"
)
parser.add_argument("url", help="URL to crawl")
parser.add_argument("--wait", type=int, default=3,
help="Seconds to wait after page load (default: 3)")
parser.add_argument("--selector", type=str, default=None,
help="CSS selector to wait for before extracting")
parser.add_argument("--scroll", action="store_true",
help="Scroll to bottom to trigger lazy loading")
parser.add_argument("--save", type=str, default=None,
help="Also save output to this file path")
parser.add_argument("--max-length", type=int, default=None,
help="Truncate output to N characters")
args = parser.parse_args()
url = args.url.strip()
if not url.startswith(("http://", "https://")):
url = "https://" + url
print(f"Crawling (dynamic): {url}", file=sys.stderr)
print(f"Options: wait={args.wait}s, selector={args.selector}, scroll={args.scroll}", file=sys.stderr)
# Run async crawl
if args.scroll:
data, error = asyncio.run(crawl_with_scroll(url, args.wait, args.selector))
else:
data, error = asyncio.run(crawl_page(url, args.wait, args.selector))
if error:
print(f"Error: crawl failed: {error}", file=sys.stderr)
sys.exit(1)
if not data or not data["markdown"]:
print("Warning: no content extracted from page", file=sys.stderr)
print("[No content could be extracted from this page]")
sys.exit(0)
# Build output
parts = []
if data["title"]:
parts.append(f"# {data['title']}\n")
parts.append(f"**Source**: {data['url']}")
if data.get("status_code"):
parts.append(f"**Status**: {data['status_code']}")
parts.append("\n---\n")
parts.append(data["markdown"])
output = "\n".join(parts)
# Truncate if requested
if args.max_length and len(output) > args.max_length:
output = output[:args.max_length] + f"\n\n[... truncated at {args.max_length} characters, total {len(output)}]"
print(output)
content_len = len(data["markdown"])
print(f"\nExtracted: {content_len} characters (dynamic crawl)", file=sys.stderr)
# Save to file if requested
if args.save:
try:
with open(args.save, "w", encoding="utf-8") as f:
f.write(output)
print(f"Saved to: {args.save}", file=sys.stderr)
except Exception as e:
print(f"Error saving file: {e}", file=sys.stderr)
if __name__ == "__main__":
main()

View File

@@ -0,0 +1,236 @@
#!/usr/bin/env python3
"""Extract and categorize all links from a web page.
Fetches the page and extracts all <a> tags, categorizing them as
internal, external, or resource links. Useful for site navigation
and discovery before deeper scraping.
Dependencies: pip install requests beautifulsoup4
"""
import sys
import argparse
import json
import re
from urllib.parse import urlparse, urljoin
def setup_encoding():
"""Setup proper encoding for Windows console output."""
if sys.platform == "win32":
import io
try:
sys.stdout.reconfigure(encoding='utf-8', errors='replace')
sys.stderr.reconfigure(encoding='utf-8', errors='replace')
except (AttributeError, io.UnsupportedOperation):
sys.stdout = io.TextIOWrapper(
sys.stdout.buffer, encoding='utf-8', errors='replace', line_buffering=True
)
sys.stderr = io.TextIOWrapper(
sys.stderr.buffer, encoding='utf-8', errors='replace', line_buffering=True
)
def check_dependencies():
"""Check that required packages are installed."""
missing = []
try:
import requests # noqa: F401
except ImportError:
missing.append("requests")
try:
from bs4 import BeautifulSoup # noqa: F401
except ImportError:
missing.append("beautifulsoup4")
if missing:
print(f"Error: missing dependencies: {', '.join(missing)}", file=sys.stderr)
print(f"Install with: pip install {' '.join(missing)}", file=sys.stderr)
sys.exit(1)
RESOURCE_EXTENSIONS = {
'.pdf', '.doc', '.docx', '.xls', '.xlsx', '.ppt', '.pptx',
'.zip', '.rar', '.tar', '.gz', '.7z',
'.jpg', '.jpeg', '.png', '.gif', '.svg', '.webp', '.ico',
'.mp3', '.mp4', '.avi', '.mov', '.webm',
'.css', '.js', '.woff', '.woff2', '.ttf', '.eot',
}
def classify_link(href, base_domain):
"""Classify a link as internal, external, or resource."""
parsed = urlparse(href)
# Check for resource files
path_lower = parsed.path.lower()
for ext in RESOURCE_EXTENSIONS:
if path_lower.endswith(ext):
return "resource"
# Check domain
link_domain = parsed.netloc.lower()
if not link_domain or link_domain == base_domain:
return "internal"
# Check for common CDN / same-org subdomains
base_parts = base_domain.split(".")
link_parts = link_domain.split(".")
if len(base_parts) >= 2 and len(link_parts) >= 2:
if base_parts[-2:] == link_parts[-2:]:
return "internal"
return "external"
def extract_links(html, base_url):
"""Extract all links from HTML."""
from bs4 import BeautifulSoup
soup = BeautifulSoup(html, "html.parser")
base_domain = urlparse(base_url).netloc.lower()
links = []
seen = set()
for a_tag in soup.find_all("a", href=True):
href = a_tag["href"].strip()
# Skip anchors, javascript:, mailto:, tel:
if not href or href.startswith(("#", "javascript:", "mailto:", "tel:")):
continue
# Resolve relative URLs
full_url = urljoin(base_url, href)
# Deduplicate
if full_url in seen:
continue
seen.add(full_url)
# Extract link text
text = a_tag.get_text(strip=True) or ""
text = re.sub(r'\s+', ' ', text) # normalize whitespace
if len(text) > 100:
text = text[:100] + "..."
link_type = classify_link(full_url, base_domain)
links.append({
"url": full_url,
"text": text,
"type": link_type,
})
return links
def format_markdown(links, url, filter_pattern=None, external_only=False):
"""Format links as Markdown."""
# Apply filters
filtered = links
if external_only:
filtered = [link for link in filtered if link["type"] == "external"]
if filter_pattern:
try:
pattern = re.compile(filter_pattern, re.IGNORECASE)
filtered = [link for link in filtered if pattern.search(link["url"])]
except re.error as e:
print(f"Warning: invalid regex pattern '{filter_pattern}': {e}", file=sys.stderr)
# Group by type
internal = [link for link in filtered if link["type"] == "internal"]
external = [link for link in filtered if link["type"] == "external"]
resources = [link for link in filtered if link["type"] == "resource"]
parts = [f"# Links from {url}\n"]
parts.append(f"Total: **{len(filtered)}** links ({len(internal)} internal, {len(external)} external, {len(resources)} resource)\n")
if internal:
parts.append("## Internal Links\n")
for lk in internal:
text = f"{lk['text']}" if lk['text'] else ""
parts.append(f"- {lk['url']}{text}")
parts.append("")
if external:
parts.append("## External Links\n")
for lk in external:
text = f"{lk['text']}" if lk['text'] else ""
parts.append(f"- {lk['url']}{text}")
parts.append("")
if resources:
parts.append("## Resource Links\n")
for lk in resources:
text = f"{lk['text']}" if lk['text'] else ""
parts.append(f"- {lk['url']}{text}")
parts.append("")
return "\n".join(parts)
def main():
setup_encoding()
check_dependencies()
parser = argparse.ArgumentParser(
description="Extract and categorize links from a web page"
)
parser.add_argument("url", help="URL to extract links from")
parser.add_argument("--filter", type=str, default=None,
help="Regex pattern to filter URLs")
parser.add_argument("--external-only", action="store_true",
help="Only show external links")
parser.add_argument("--json", action="store_true",
help="Output as JSON instead of Markdown")
parser.add_argument("--timeout", type=int, default=30,
help="Request timeout in seconds (default: 30)")
args = parser.parse_args()
import requests
url = args.url.strip()
if not url.startswith(("http://", "https://")):
url = "https://" + url
print(f"Extracting links from: {url}", file=sys.stderr)
headers = {
"User-Agent": (
"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 "
"(KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36"
),
}
try:
resp = requests.get(url, headers=headers, timeout=args.timeout, allow_redirects=True)
resp.raise_for_status()
if resp.encoding and resp.encoding.lower() != 'utf-8':
resp.encoding = resp.apparent_encoding or resp.encoding
html = resp.text
final_url = resp.url
except requests.exceptions.RequestException as e:
print(f"Error: {e}", file=sys.stderr)
sys.exit(1)
links = extract_links(html, final_url)
print(f"Found {len(links)} unique links", file=sys.stderr)
if args.json:
# Apply filters for JSON output too
filtered = links
if args.external_only:
filtered = [lk for lk in filtered if lk["type"] == "external"]
if args.filter:
try:
pattern = re.compile(args.filter, re.IGNORECASE)
filtered = [lk for lk in filtered if pattern.search(lk["url"])]
except re.error:
pass
print(json.dumps(filtered, indent=2, ensure_ascii=False))
else:
print(format_markdown(links, final_url, args.filter, args.external_only))
if __name__ == "__main__":
main()

View File

@@ -0,0 +1,284 @@
#!/usr/bin/env python3
"""Fetch a web page and extract readable content as clean Markdown.
Uses requests + BeautifulSoup + readability-lxml + html2text for lightweight,
fast extraction without a headless browser. Works well for articles, docs,
blogs, wikis, and most static websites.
Dependencies: pip install requests beautifulsoup4 readability-lxml html2text
"""
import sys
import argparse
def setup_encoding():
"""Setup proper encoding for Windows console output."""
if sys.platform == "win32":
import io
try:
sys.stdout.reconfigure(encoding='utf-8', errors='replace')
sys.stderr.reconfigure(encoding='utf-8', errors='replace')
except (AttributeError, io.UnsupportedOperation):
sys.stdout = io.TextIOWrapper(
sys.stdout.buffer, encoding='utf-8', errors='replace', line_buffering=True
)
sys.stderr = io.TextIOWrapper(
sys.stderr.buffer, encoding='utf-8', errors='replace', line_buffering=True
)
def check_dependencies():
"""Check that required packages are installed."""
missing = []
try:
import requests # noqa: F401
except ImportError:
missing.append("requests")
try:
from bs4 import BeautifulSoup # noqa: F401
except ImportError:
missing.append("beautifulsoup4")
try:
from readability import Document # noqa: F401
except ImportError:
missing.append("readability-lxml")
try:
import html2text # noqa: F401
except ImportError:
missing.append("html2text")
if missing:
print(f"Error: missing dependencies: {', '.join(missing)}", file=sys.stderr)
print(f"Install with: pip install {' '.join(missing)}", file=sys.stderr)
sys.exit(1)
def fetch_url(url, timeout=30):
"""Fetch URL content with proper headers."""
import requests
headers = {
"User-Agent": (
"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 "
"(KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36"
),
"Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8",
"Accept-Language": "en-US,en;q=0.9,zh-CN;q=0.8,zh;q=0.7",
"Accept-Encoding": "gzip, deflate, br",
}
try:
resp = requests.get(url, headers=headers, timeout=timeout, allow_redirects=True)
resp.raise_for_status()
# Detect encoding
if resp.encoding and resp.encoding.lower() != 'utf-8':
resp.encoding = resp.apparent_encoding or resp.encoding
return resp.text, resp.url, resp.status_code
except requests.exceptions.Timeout:
print(f"Error: request timed out after {timeout}s", file=sys.stderr)
sys.exit(1)
except requests.exceptions.ConnectionError as e:
print(f"Error: connection failed: {e}", file=sys.stderr)
sys.exit(1)
except requests.exceptions.HTTPError as e:
print(f"Error: HTTP {e.response.status_code}: {e}", file=sys.stderr)
sys.exit(1)
except Exception as e:
print(f"Error: {e}", file=sys.stderr)
sys.exit(1)
def extract_with_readability(html, url):
"""Extract main article content using readability-lxml."""
from readability import Document
doc = Document(html, url=url)
title = doc.short_title()
content_html = doc.summary()
return title, content_html
def extract_with_selector(html, selector):
"""Extract content matching a CSS selector."""
from bs4 import BeautifulSoup
soup = BeautifulSoup(html, "html.parser")
elements = soup.select(selector)
if not elements:
return None
# Combine all matching elements
parts = []
for el in elements:
parts.append(str(el))
return "\n".join(parts)
def html_to_markdown(html, base_url=None):
"""Convert HTML to clean Markdown."""
import html2text
converter = html2text.HTML2Text()
converter.body_width = 0 # Don't wrap lines
converter.ignore_images = False
converter.ignore_links = False
converter.ignore_emphasis = False
converter.protect_links = True
converter.unicode_snob = True
converter.mark_code = True
converter.wrap_links = False
converter.single_line_break = False
if base_url:
converter.baseurl = base_url
md = converter.handle(html)
# Clean up excessive blank lines
import re
md = re.sub(r'\n{3,}', '\n\n', md)
return md.strip()
def extract_metadata(html):
"""Extract page metadata (title, description, etc.)."""
from bs4 import BeautifulSoup
soup = BeautifulSoup(html, "html.parser")
meta = {}
# Title
title_tag = soup.find("title")
if title_tag:
meta["title"] = title_tag.get_text(strip=True)
# Meta description
desc_tag = soup.find("meta", attrs={"name": "description"})
if desc_tag and desc_tag.get("content"):
meta["description"] = desc_tag["content"].strip()
# OG tags
for prop in ["og:title", "og:description", "og:type", "og:site_name"]:
tag = soup.find("meta", attrs={"property": prop})
if tag and tag.get("content"):
meta[prop.replace("og:", "og_")] = tag["content"].strip()
# Author
author_tag = soup.find("meta", attrs={"name": "author"})
if author_tag and author_tag.get("content"):
meta["author"] = author_tag["content"].strip()
# Published date
for attr in ["article:published_time", "datePublished", "date"]:
date_tag = soup.find("meta", attrs={"property": attr}) or soup.find("meta", attrs={"name": attr})
if date_tag and date_tag.get("content"):
meta["published"] = date_tag["content"].strip()
break
return meta
def main():
setup_encoding()
check_dependencies()
parser = argparse.ArgumentParser(
description="Fetch a web page and extract content as Markdown"
)
parser.add_argument("url", help="URL to fetch")
parser.add_argument("--raw", action="store_true",
help="Output full page Markdown (no readability extraction)")
parser.add_argument("--selector", type=str, default=None,
help="CSS selector to extract specific elements")
parser.add_argument("--save", type=str, default=None,
help="Also save output to this file path")
parser.add_argument("--max-length", type=int, default=None,
help="Truncate output to N characters")
parser.add_argument("--timeout", type=int, default=30,
help="Request timeout in seconds (default: 30)")
parser.add_argument("--no-metadata", action="store_true",
help="Skip metadata header in output")
args = parser.parse_args()
# Normalize URL
url = args.url.strip()
if not url.startswith(("http://", "https://")):
url = "https://" + url
print(f"Fetching: {url}", file=sys.stderr)
# Fetch
html, final_url, status = fetch_url(url, timeout=args.timeout)
print(f"Status: {status}, Size: {len(html)} bytes", file=sys.stderr)
if final_url != url:
print(f"Redirected to: {final_url}", file=sys.stderr)
# Extract metadata
meta = extract_metadata(html) if not args.no_metadata else {}
# Extract content
if args.selector:
# CSS selector mode
selected_html = extract_with_selector(html, args.selector)
if not selected_html:
print(f"Warning: no elements matched selector '{args.selector}'", file=sys.stderr)
print(f"[No elements matched CSS selector: {args.selector}]")
sys.exit(0)
title = meta.get("title", "")
content_md = html_to_markdown(selected_html, base_url=final_url)
elif args.raw:
# Raw full-page mode
title = meta.get("title", "")
content_md = html_to_markdown(html, base_url=final_url)
else:
# Readability extraction mode (default)
title, article_html = extract_with_readability(html, final_url)
content_md = html_to_markdown(article_html, base_url=final_url)
# Build output
parts = []
if not args.no_metadata and meta:
parts.append(f"# {title or meta.get('title', 'Untitled')}")
parts.append(f"\n**Source**: {final_url}")
if meta.get("author"):
parts.append(f"**Author**: {meta['author']}")
if meta.get("published"):
parts.append(f"**Published**: {meta['published']}")
if meta.get("description"):
parts.append(f"**Description**: {meta['description']}")
parts.append("\n---\n")
elif title and not args.no_metadata:
parts.append(f"# {title}\n")
parts.append(content_md)
output = "\n".join(parts)
# Truncate if requested
if args.max_length and len(output) > args.max_length:
output = output[:args.max_length] + f"\n\n[... truncated at {args.max_length} characters, total {len(output)}]"
# Print to stdout
print(output)
content_length = len(content_md)
print(f"\nExtracted: {content_length} characters", file=sys.stderr)
# Save to file if requested
if args.save:
try:
with open(args.save, "w", encoding="utf-8") as f:
f.write(output)
print(f"Saved to: {args.save}", file=sys.stderr)
except Exception as e:
print(f"Error saving file: {e}", file=sys.stderr)
if __name__ == "__main__":
main()

View File

@@ -0,0 +1,155 @@
#!/usr/bin/env python3
"""Search the web using DuckDuckGo and return structured results.
No API key required. Returns results with title, URL, and snippet.
Dependencies: pip install ddgs
"""
import sys
import argparse
import json
def setup_encoding():
"""Setup proper encoding for Windows console output."""
if sys.platform == "win32":
import io
try:
sys.stdout.reconfigure(encoding='utf-8', errors='replace')
sys.stderr.reconfigure(encoding='utf-8', errors='replace')
except (AttributeError, io.UnsupportedOperation):
sys.stdout = io.TextIOWrapper(
sys.stdout.buffer, encoding='utf-8', errors='replace', line_buffering=True
)
sys.stderr = io.TextIOWrapper(
sys.stderr.buffer, encoding='utf-8', errors='replace', line_buffering=True
)
def check_dependencies():
"""Check that required packages are installed."""
try:
from ddgs import DDGS # noqa: F401
except ImportError:
try:
from duckduckgo_search import DDGS # noqa: F401
except ImportError:
print("Error: ddgs not installed.", file=sys.stderr)
print("Install with: pip install ddgs", file=sys.stderr)
sys.exit(1)
def _get_ddgs_class():
"""Import DDGS from ddgs (new) or duckduckgo_search (legacy)."""
try:
from ddgs import DDGS
return DDGS
except ImportError:
from duckduckgo_search import DDGS
return DDGS
def search_text(query, max_results=10, region="wt-wt"):
"""Perform a text search."""
DDGS = _get_ddgs_class()
ddgs = DDGS()
try:
# New ddgs package: positional 'query' arg
results = list(ddgs.text(query, region=region, max_results=max_results))
except TypeError:
# Legacy duckduckgo_search: 'keywords' kwarg + context manager
with DDGS() as d:
results = list(d.text(keywords=query, region=region, max_results=max_results))
return results
def search_news(query, max_results=10, region="wt-wt"):
"""Perform a news search."""
DDGS = _get_ddgs_class()
ddgs = DDGS()
try:
results = list(ddgs.news(query, region=region, max_results=max_results))
except TypeError:
with DDGS() as d:
results = list(d.news(keywords=query, region=region, max_results=max_results))
return results
def format_results_markdown(results, query, is_news=False):
"""Format search results as Markdown."""
search_type = "News" if is_news else "Web"
parts = [f"# {search_type} Search Results: {query}\n"]
parts.append(f"Found **{len(results)}** results.\n")
for i, r in enumerate(results, 1):
title = r.get("title", "Untitled")
url = r.get("href") or r.get("url") or r.get("link", "")
body = r.get("body") or r.get("snippet", "")
date = r.get("date", "")
parts.append(f"## {i}. {title}")
parts.append(f"**URL**: {url}")
if date:
parts.append(f"**Date**: {date}")
if body:
parts.append(f"\n{body}")
parts.append("") # blank line
return "\n".join(parts)
def format_results_json(results):
"""Format search results as JSON."""
return json.dumps(results, indent=2, ensure_ascii=False)
def main():
setup_encoding()
check_dependencies()
parser = argparse.ArgumentParser(
description="Search the web via DuckDuckGo"
)
parser.add_argument("query", help="Search query")
parser.add_argument("--max-results", type=int, default=10,
help="Number of results (default: 10)")
parser.add_argument("--region", type=str, default="wt-wt",
help="Region code, e.g. cn-zh, us-en, jp-jp (default: wt-wt)")
parser.add_argument("--news", action="store_true",
help="Search news instead of general web")
parser.add_argument("--json", action="store_true",
help="Output as JSON instead of Markdown")
args = parser.parse_args()
query = args.query.strip()
if not query:
print("Error: empty query", file=sys.stderr)
sys.exit(1)
print(f"Searching: {query} (region={args.region}, max={args.max_results})", file=sys.stderr)
try:
if args.news:
results = search_news(query, args.max_results, args.region)
else:
results = search_text(query, args.max_results, args.region)
except Exception as e:
print(f"Error: search failed: {e}", file=sys.stderr)
sys.exit(1)
if not results:
print(f"No results found for: {query}")
sys.exit(0)
print(f"Got {len(results)} results", file=sys.stderr)
if args.json:
print(format_results_json(results))
else:
print(format_results_markdown(results, query, is_news=args.news))
if __name__ == "__main__":
main()