AI Newsletter Digest improvements: fixed QP soft line break decoding, URL extraction, and content cleaning

2026-03-04 13:29:22 +00:00
parent 29a98137a7
commit 57dd294675
13706 changed files with 2114953 additions and 237629 deletions
--- a/skills/aidotnet-web-scraper/SKILL.md
+++ b/skills/aidotnet-web-scraper/SKILL.md
@@ -0,0 +1,157 @@
+---
+name: web-scraper
+description: Scrape web pages, search the internet, and extract structured content using Python. Use when the user wants to fetch a webpage, search for information online, extract links, or crawl JavaScript-rendered dynamic pages.
+compatibility: Requires Python 3. Lightweight mode needs requests, beautifulsoup4, readability-lxml, html2text. Dynamic mode needs crawl4ai. Search needs duckduckgo-search.
+---
+
+# Web Scraper
+
+Fetch, search, and extract content from websites.
+
+## When to use this skill
+
+- User asks to fetch or read a webpage / URL
+- User wants to search the internet for information
+- User needs to extract links, tables, or structured data from a website
+- User asks to crawl a JavaScript-rendered (dynamic) page
+- User wants web content converted to clean Markdown for analysis
+
+## Scripts overview
+
+| Script | Purpose | Dependencies |
+|---|---|---|
+| `fetch_page.py` | Fetch a URL and extract readable content as Markdown | `requests`, `beautifulsoup4`, `readability-lxml`, `html2text` |
+| `search_web.py` | Search the web via DuckDuckGo | `ddgs` |
+| `crawl_dynamic.py` | Crawl JS-rendered pages with a headless browser | `crawl4ai` |
+| `extract_links.py` | Extract and categorize all links from a page | `requests`, `beautifulsoup4` |
+
+## Steps
+
+### 1. Install dependencies (first time only)
+
+For lightweight scraping (static pages, search, link extraction):
+```bash
+pip install requests beautifulsoup4 readability-lxml html2text ddgs
+```
+
+For dynamic / JavaScript-rendered pages (heavier, installs Playwright + Chromium):
+```bash
+pip install crawl4ai
+crawl4ai-setup
+```
+
+> **Note**: `crawl4ai-setup` downloads a Chromium browser (~150 MB). Only install if you actually need dynamic page support.
+
+> **CRITICAL — Dependency Error Recovery**: If ANY script below fails with an `ImportError` or "module not found" error, install the missing dependencies using the command above, then **re-run the EXACT SAME script command that failed**. Do NOT write inline Python code (`python -c "..."`) or your own ad-hoc scripts as a substitute. These scripts handle encoding, error handling, and output formatting that inline code will miss.
+
+### 2. Fetch a web page (static — recommended first choice)
+
+Use this for most websites. It's fast, lightweight, and works for articles, docs, blogs, etc.
+
+```bash
+python scripts/fetch_page.py "URL"
+```
+
+Options:
+- `--raw` — Output full page Markdown instead of extracted article content
+- `--selector "CSS_SELECTOR"` — Extract only elements matching the CSS selector (e.g. `".article-body"`, `"table"`, `"#content"`)
+- `--save OUTPUT_PATH` — Also save output to a file
+- `--max-length N` — Truncate output to N characters (default: no limit)
+
+Examples:
+```bash
+# Fetch an article
+python fetch_page.py "https://example.com/article"
+
+# Extract only tables
+python fetch_page.py "https://example.com/data" --selector "table"
+
+# Fetch raw full-page markdown, limit to 5000 chars
+python fetch_page.py "https://example.com" --raw --max-length 5000
+```
+
+### 3. Search the web
+
+Search using DuckDuckGo (no API key required).
+
+```bash
+python scripts/search_web.py "search query"
+```
+
+Options:
+- `--max-results N` — Number of results to return (default: 10)
+- `--region REGION` — Region code, e.g. `cn-zh`, `us-en`, `jp-jp` (default: `wt-wt` for worldwide)
+- `--news` — Search news instead of general web
+
+Examples:
+```bash
+# General search
+python search_web.py "Python web scraping best practices 2025"
+
+# News search, Chinese region, 5 results
+python search_web.py "AI 最新进展" --news --region cn-zh --max-results 5
+```
+
+### 4. Crawl a dynamic / JavaScript-rendered page
+
+Use this only when `fetch_page.py` returns empty or incomplete content (SPA, React/Vue apps, pages that load content via JS).
+
+```bash
+python scripts/crawl_dynamic.py "URL"
+```
+
+Options:
+- `--wait N` — Wait N seconds after page load for JS to finish (default: 3)
+- `--selector "CSS_SELECTOR"` — Wait for a specific element to appear before extracting
+- `--scroll` — Scroll to bottom of page to trigger lazy loading
+- `--save OUTPUT_PATH` — Also save output to a file
+- `--max-length N` — Truncate output to N characters
+
+### 5. Extract links from a page
+
+Extract all links with their text labels, categorized by type (internal, external, resource).
+
+```bash
+python scripts/extract_links.py "URL"
+```
+
+Options:
+- `--filter PATTERN` — Only show links matching a regex pattern (applied to URL)
+- `--external-only` — Only show external links
+- `--json` — Output as JSON instead of Markdown
+
+## Decision guide: which script to use
+
+1. **Start with `fetch_page.py`** — handles 90% of websites (articles, docs, blogs, wikis).
+2. If `fetch_page.py` returns empty/garbled content → try **`crawl_dynamic.py`** (the page likely needs JavaScript).
+3. Need to find URLs first? → Use **`search_web.py`** to discover relevant pages.
+4. Need to navigate a site structure? → Use **`extract_links.py`** to map out links, then fetch individual pages.
+
+## Common workflows
+
+### Research a topic
+1. `search_web.py "topic"` → get relevant URLs
+2. `fetch_page.py "best_url"` → read the content
+3. Repeat for multiple sources, then synthesize
+
+### Scrape structured data from a page
+1. `fetch_page.py "url" --selector "table"` → extract tables
+2. Or `fetch_page.py "url" --selector ".product-card"` → extract specific elements
+
+### Crawl a modern web app (SPA)
+1. `crawl_dynamic.py "url" --wait 5 --scroll` → full JS-rendered content
+
+## Edge cases
+
+- **Paywalled sites**: May return partial content or login pages. Inform the user.
+- **Rate limiting / CAPTCHAs**: If requests fail with 403/429, wait and retry or inform the user.
+- **Very large pages**: Use `--max-length` to truncate output and avoid overwhelming the context window.
+- **Encoding issues**: Scripts handle UTF-8 by default. Exotic encodings may need manual adjustment.
+- **Robots.txt**: These scripts do not check robots.txt. Use responsibly and respect website terms of service.
+
+## Scripts
+
+- [fetch_page.py](scripts/fetch_page.py) — Fetch and extract readable content as Markdown
+- [search_web.py](scripts/search_web.py) — Search the web via DuckDuckGo
+- [crawl_dynamic.py](scripts/crawl_dynamic.py) — Crawl JavaScript-rendered pages
+- [extract_links.py](scripts/extract_links.py) — Extract and categorize page links
--- a/skills/aidotnet-web-scraper/scripts/pycache/fetch_page.cpython-313.pyc
+++ b/skills/aidotnet-web-scraper/scripts/pycache/fetch_page.cpython-313.pyc
--- a/skills/aidotnet-web-scraper/scripts/crawl_dynamic.py
+++ b/skills/aidotnet-web-scraper/scripts/crawl_dynamic.py
@@ -0,0 +1,254 @@
+#!/usr/bin/env python3
+"""Crawl JavaScript-rendered (dynamic) web pages using Crawl4AI.
+
+Uses a headless Chromium browser to render pages that require JavaScript,
+then extracts clean Markdown content. Use this when fetch_page.py returns
+empty or incomplete content (SPAs, React/Vue apps, etc.).
+
+Dependencies: pip install crawl4ai && crawl4ai-setup
+"""
+
+import sys
+import argparse
+import asyncio
+
+
+def setup_encoding():
+    """Setup proper encoding for Windows console output."""
+    if sys.platform == "win32":
+        import io
+        try:
+            sys.stdout.reconfigure(encoding='utf-8', errors='replace')
+            sys.stderr.reconfigure(encoding='utf-8', errors='replace')
+        except (AttributeError, io.UnsupportedOperation):
+            sys.stdout = io.TextIOWrapper(
+                sys.stdout.buffer, encoding='utf-8', errors='replace', line_buffering=True
+            )
+            sys.stderr = io.TextIOWrapper(
+                sys.stderr.buffer, encoding='utf-8', errors='replace', line_buffering=True
+            )
+
+
+def check_dependencies():
+    """Check that Crawl4AI is installed."""
+    try:
+        import crawl4ai  # noqa: F401
+    except ImportError:
+        print("Error: crawl4ai not installed.", file=sys.stderr)
+        print("Install with:", file=sys.stderr)
+        print("  pip install crawl4ai", file=sys.stderr)
+        print("  crawl4ai-setup", file=sys.stderr)
+        sys.exit(1)
+
+
+async def crawl_page(url, wait_seconds=3, css_selector=None, scroll=False):
+    """Crawl a page with headless browser and return Markdown content."""
+    from crawl4ai import AsyncWebCrawler, BrowserConfig, CrawlerRunConfig, CacheMode
+    from crawl4ai.content_filter_strategy import PruningContentFilter
+    from crawl4ai.markdown_generation_strategy import DefaultMarkdownGenerator
+
+    browser_conf = BrowserConfig(
+        headless=True,
+        verbose=False,
+    )
+
+    md_generator = DefaultMarkdownGenerator(
+        content_filter=PruningContentFilter(threshold=0.4, threshold_type="fixed")
+    )
+
+    run_conf = CrawlerRunConfig(
+        cache_mode=CacheMode.BYPASS,
+        markdown_generator=md_generator,
+        page_timeout=60000,  # 60s
+        wait_until="networkidle",
+    )
+
+    # Add wait time if specified
+    if wait_seconds > 0:
+        run_conf.delay_before_return_html = wait_seconds
+
+    # Wait for specific CSS selector
+    if css_selector:
+        run_conf.wait_for = f"css:{css_selector}"
+
+    async with AsyncWebCrawler(config=browser_conf) as crawler:
+        result = await crawler.arun(url=url, config=run_conf)
+
+        if not result.success:
+            return None, result.error_message or "Unknown error"
+
+        # Get the best available markdown
+        md = ""
+        if result.markdown:
+            if hasattr(result.markdown, 'fit_markdown') and result.markdown.fit_markdown:
+                md = result.markdown.fit_markdown
+            elif hasattr(result.markdown, 'raw_markdown') and result.markdown.raw_markdown:
+                md = result.markdown.raw_markdown
+            elif isinstance(result.markdown, str):
+                md = result.markdown
+
+        title = ""
+        if hasattr(result, 'metadata') and result.metadata:
+            title = result.metadata.get('title', '')
+
+        return {
+            "title": title,
+            "url": result.url or url,
+            "markdown": md,
+            "status_code": getattr(result, 'status_code', None),
+        }, None
+
+
+async def crawl_with_scroll(url, wait_seconds=3, css_selector=None):
+    """Crawl with infinite scroll support."""
+    from crawl4ai import AsyncWebCrawler, BrowserConfig, CrawlerRunConfig, CacheMode
+    from crawl4ai.markdown_generation_strategy import DefaultMarkdownGenerator
+    from crawl4ai.content_filter_strategy import PruningContentFilter
+
+    browser_conf = BrowserConfig(
+        headless=True,
+        verbose=False,
+    )
+
+    # JavaScript to scroll to bottom
+    scroll_js = """
+    async () => {
+        await new Promise((resolve) => {
+            let totalHeight = 0;
+            const distance = 500;
+            const timer = setInterval(() => {
+                const scrollHeight = document.body.scrollHeight;
+                window.scrollBy(0, distance);
+                totalHeight += distance;
+                if (totalHeight >= scrollHeight) {
+                    clearInterval(timer);
+                    resolve();
+                }
+            }, 300);
+            // Safety timeout
+            setTimeout(() => { clearInterval(timer); resolve(); }, 15000);
+        });
+    }
+    """
+
+    md_generator = DefaultMarkdownGenerator(
+        content_filter=PruningContentFilter(threshold=0.4, threshold_type="fixed")
+    )
+
+    run_conf = CrawlerRunConfig(
+        cache_mode=CacheMode.BYPASS,
+        markdown_generator=md_generator,
+        page_timeout=60000,
+        js_code=scroll_js,
+        wait_until="networkidle",
+    )
+
+    if wait_seconds > 0:
+        run_conf.delay_before_return_html = wait_seconds
+
+    if css_selector:
+        run_conf.wait_for = f"css:{css_selector}"
+
+    async with AsyncWebCrawler(config=browser_conf) as crawler:
+        result = await crawler.arun(url=url, config=run_conf)
+
+        if not result.success:
+            return None, result.error_message or "Unknown error"
+
+        md = ""
+        if result.markdown:
+            if hasattr(result.markdown, 'fit_markdown') and result.markdown.fit_markdown:
+                md = result.markdown.fit_markdown
+            elif hasattr(result.markdown, 'raw_markdown') and result.markdown.raw_markdown:
+                md = result.markdown.raw_markdown
+            elif isinstance(result.markdown, str):
+                md = result.markdown
+
+        title = ""
+        if hasattr(result, 'metadata') and result.metadata:
+            title = result.metadata.get('title', '')
+
+        return {
+            "title": title,
+            "url": result.url or url,
+            "markdown": md,
+            "status_code": getattr(result, 'status_code', None),
+        }, None
+
+
+def main():
+    setup_encoding()
+    check_dependencies()
+
+    parser = argparse.ArgumentParser(
+        description="Crawl JavaScript-rendered pages with headless browser"
+    )
+    parser.add_argument("url", help="URL to crawl")
+    parser.add_argument("--wait", type=int, default=3,
+                        help="Seconds to wait after page load (default: 3)")
+    parser.add_argument("--selector", type=str, default=None,
+                        help="CSS selector to wait for before extracting")
+    parser.add_argument("--scroll", action="store_true",
+                        help="Scroll to bottom to trigger lazy loading")
+    parser.add_argument("--save", type=str, default=None,
+                        help="Also save output to this file path")
+    parser.add_argument("--max-length", type=int, default=None,
+                        help="Truncate output to N characters")
+
+    args = parser.parse_args()
+
+    url = args.url.strip()
+    if not url.startswith(("http://", "https://")):
+        url = "https://" + url
+
+    print(f"Crawling (dynamic): {url}", file=sys.stderr)
+    print(f"Options: wait={args.wait}s, selector={args.selector}, scroll={args.scroll}", file=sys.stderr)
+
+    # Run async crawl
+    if args.scroll:
+        data, error = asyncio.run(crawl_with_scroll(url, args.wait, args.selector))
+    else:
+        data, error = asyncio.run(crawl_page(url, args.wait, args.selector))
+
+    if error:
+        print(f"Error: crawl failed: {error}", file=sys.stderr)
+        sys.exit(1)
+
+    if not data or not data["markdown"]:
+        print("Warning: no content extracted from page", file=sys.stderr)
+        print("[No content could be extracted from this page]")
+        sys.exit(0)
+
+    # Build output
+    parts = []
+    if data["title"]:
+        parts.append(f"# {data['title']}\n")
+    parts.append(f"**Source**: {data['url']}")
+    if data.get("status_code"):
+        parts.append(f"**Status**: {data['status_code']}")
+    parts.append("\n---\n")
+    parts.append(data["markdown"])
+
+    output = "\n".join(parts)
+
+    # Truncate if requested
+    if args.max_length and len(output) > args.max_length:
+        output = output[:args.max_length] + f"\n\n[... truncated at {args.max_length} characters, total {len(output)}]"
+
+    print(output)
+
+    content_len = len(data["markdown"])
+    print(f"\nExtracted: {content_len} characters (dynamic crawl)", file=sys.stderr)
+
+    # Save to file if requested
+    if args.save:
+        try:
+            with open(args.save, "w", encoding="utf-8") as f:
+                f.write(output)
+            print(f"Saved to: {args.save}", file=sys.stderr)
+        except Exception as e:
+            print(f"Error saving file: {e}", file=sys.stderr)
+
+
+if __name__ == "__main__":
+    main()
--- a/skills/aidotnet-web-scraper/scripts/extract_links.py
+++ b/skills/aidotnet-web-scraper/scripts/extract_links.py
@@ -0,0 +1,236 @@
+#!/usr/bin/env python3
+"""Extract and categorize all links from a web page.
+
+Fetches the page and extracts all <a> tags, categorizing them as
+internal, external, or resource links. Useful for site navigation
+and discovery before deeper scraping.
+
+Dependencies: pip install requests beautifulsoup4
+"""
+
+import sys
+import argparse
+import json
+import re
+from urllib.parse import urlparse, urljoin
+
+
+def setup_encoding():
+    """Setup proper encoding for Windows console output."""
+    if sys.platform == "win32":
+        import io
+        try:
+            sys.stdout.reconfigure(encoding='utf-8', errors='replace')
+            sys.stderr.reconfigure(encoding='utf-8', errors='replace')
+        except (AttributeError, io.UnsupportedOperation):
+            sys.stdout = io.TextIOWrapper(
+                sys.stdout.buffer, encoding='utf-8', errors='replace', line_buffering=True
+            )
+            sys.stderr = io.TextIOWrapper(
+                sys.stderr.buffer, encoding='utf-8', errors='replace', line_buffering=True
+            )
+
+
+def check_dependencies():
+    """Check that required packages are installed."""
+    missing = []
+    try:
+        import requests  # noqa: F401
+    except ImportError:
+        missing.append("requests")
+    try:
+        from bs4 import BeautifulSoup  # noqa: F401
+    except ImportError:
+        missing.append("beautifulsoup4")
+    if missing:
+        print(f"Error: missing dependencies: {', '.join(missing)}", file=sys.stderr)
+        print(f"Install with: pip install {' '.join(missing)}", file=sys.stderr)
+        sys.exit(1)
+
+
+RESOURCE_EXTENSIONS = {
+    '.pdf', '.doc', '.docx', '.xls', '.xlsx', '.ppt', '.pptx',
+    '.zip', '.rar', '.tar', '.gz', '.7z',
+    '.jpg', '.jpeg', '.png', '.gif', '.svg', '.webp', '.ico',
+    '.mp3', '.mp4', '.avi', '.mov', '.webm',
+    '.css', '.js', '.woff', '.woff2', '.ttf', '.eot',
+}
+
+
+def classify_link(href, base_domain):
+    """Classify a link as internal, external, or resource."""
+    parsed = urlparse(href)
+
+    # Check for resource files
+    path_lower = parsed.path.lower()
+    for ext in RESOURCE_EXTENSIONS:
+        if path_lower.endswith(ext):
+            return "resource"
+
+    # Check domain
+    link_domain = parsed.netloc.lower()
+    if not link_domain or link_domain == base_domain:
+        return "internal"
+
+    # Check for common CDN / same-org subdomains
+    base_parts = base_domain.split(".")
+    link_parts = link_domain.split(".")
+    if len(base_parts) >= 2 and len(link_parts) >= 2:
+        if base_parts[-2:] == link_parts[-2:]:
+            return "internal"
+
+    return "external"
+
+
+def extract_links(html, base_url):
+    """Extract all links from HTML."""
+    from bs4 import BeautifulSoup
+
+    soup = BeautifulSoup(html, "html.parser")
+    base_domain = urlparse(base_url).netloc.lower()
+    links = []
+    seen = set()
+
+    for a_tag in soup.find_all("a", href=True):
+        href = a_tag["href"].strip()
+
+        # Skip anchors, javascript:, mailto:, tel:
+        if not href or href.startswith(("#", "javascript:", "mailto:", "tel:")):
+            continue
+
+        # Resolve relative URLs
+        full_url = urljoin(base_url, href)
+
+        # Deduplicate
+        if full_url in seen:
+            continue
+        seen.add(full_url)
+
+        # Extract link text
+        text = a_tag.get_text(strip=True) or ""
+        text = re.sub(r'\s+', ' ', text)  # normalize whitespace
+        if len(text) > 100:
+            text = text[:100] + "..."
+
+        link_type = classify_link(full_url, base_domain)
+
+        links.append({
+            "url": full_url,
+            "text": text,
+            "type": link_type,
+        })
+
+    return links
+
+
+def format_markdown(links, url, filter_pattern=None, external_only=False):
+    """Format links as Markdown."""
+    # Apply filters
+    filtered = links
+    if external_only:
+        filtered = [link for link in filtered if link["type"] == "external"]
+    if filter_pattern:
+        try:
+            pattern = re.compile(filter_pattern, re.IGNORECASE)
+            filtered = [link for link in filtered if pattern.search(link["url"])]
+        except re.error as e:
+            print(f"Warning: invalid regex pattern '{filter_pattern}': {e}", file=sys.stderr)
+
+    # Group by type
+    internal = [link for link in filtered if link["type"] == "internal"]
+    external = [link for link in filtered if link["type"] == "external"]
+    resources = [link for link in filtered if link["type"] == "resource"]
+
+    parts = [f"# Links from {url}\n"]
+    parts.append(f"Total: **{len(filtered)}** links ({len(internal)} internal, {len(external)} external, {len(resources)} resource)\n")
+
+    if internal:
+        parts.append("## Internal Links\n")
+        for lk in internal:
+            text = f" — {lk['text']}" if lk['text'] else ""
+            parts.append(f"- {lk['url']}{text}")
+        parts.append("")
+
+    if external:
+        parts.append("## External Links\n")
+        for lk in external:
+            text = f" — {lk['text']}" if lk['text'] else ""
+            parts.append(f"- {lk['url']}{text}")
+        parts.append("")
+
+    if resources:
+        parts.append("## Resource Links\n")
+        for lk in resources:
+            text = f" — {lk['text']}" if lk['text'] else ""
+            parts.append(f"- {lk['url']}{text}")
+        parts.append("")
+
+    return "\n".join(parts)
+
+
+def main():
+    setup_encoding()
+    check_dependencies()
+
+    parser = argparse.ArgumentParser(
+        description="Extract and categorize links from a web page"
+    )
+    parser.add_argument("url", help="URL to extract links from")
+    parser.add_argument("--filter", type=str, default=None,
+                        help="Regex pattern to filter URLs")
+    parser.add_argument("--external-only", action="store_true",
+                        help="Only show external links")
+    parser.add_argument("--json", action="store_true",
+                        help="Output as JSON instead of Markdown")
+    parser.add_argument("--timeout", type=int, default=30,
+                        help="Request timeout in seconds (default: 30)")
+
+    args = parser.parse_args()
+
+    import requests
+
+    url = args.url.strip()
+    if not url.startswith(("http://", "https://")):
+        url = "https://" + url
+
+    print(f"Extracting links from: {url}", file=sys.stderr)
+
+    headers = {
+        "User-Agent": (
+            "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 "
+            "(KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36"
+        ),
+    }
+
+    try:
+        resp = requests.get(url, headers=headers, timeout=args.timeout, allow_redirects=True)
+        resp.raise_for_status()
+        if resp.encoding and resp.encoding.lower() != 'utf-8':
+            resp.encoding = resp.apparent_encoding or resp.encoding
+        html = resp.text
+        final_url = resp.url
+    except requests.exceptions.RequestException as e:
+        print(f"Error: {e}", file=sys.stderr)
+        sys.exit(1)
+
+    links = extract_links(html, final_url)
+    print(f"Found {len(links)} unique links", file=sys.stderr)
+
+    if args.json:
+        # Apply filters for JSON output too
+        filtered = links
+        if args.external_only:
+            filtered = [lk for lk in filtered if lk["type"] == "external"]
+        if args.filter:
+            try:
+                pattern = re.compile(args.filter, re.IGNORECASE)
+                filtered = [lk for lk in filtered if pattern.search(lk["url"])]
+            except re.error:
+                pass
+        print(json.dumps(filtered, indent=2, ensure_ascii=False))
+    else:
+        print(format_markdown(links, final_url, args.filter, args.external_only))
+
+
+if __name__ == "__main__":
+    main()
--- a/skills/aidotnet-web-scraper/scripts/fetch_page.py
+++ b/skills/aidotnet-web-scraper/scripts/fetch_page.py
@@ -0,0 +1,284 @@
+#!/usr/bin/env python3
+"""Fetch a web page and extract readable content as clean Markdown.
+
+Uses requests + BeautifulSoup + readability-lxml + html2text for lightweight,
+fast extraction without a headless browser. Works well for articles, docs,
+blogs, wikis, and most static websites.
+
+Dependencies: pip install requests beautifulsoup4 readability-lxml html2text
+"""
+
+import sys
+import argparse
+
+
+def setup_encoding():
+    """Setup proper encoding for Windows console output."""
+    if sys.platform == "win32":
+        import io
+        try:
+            sys.stdout.reconfigure(encoding='utf-8', errors='replace')
+            sys.stderr.reconfigure(encoding='utf-8', errors='replace')
+        except (AttributeError, io.UnsupportedOperation):
+            sys.stdout = io.TextIOWrapper(
+                sys.stdout.buffer, encoding='utf-8', errors='replace', line_buffering=True
+            )
+            sys.stderr = io.TextIOWrapper(
+                sys.stderr.buffer, encoding='utf-8', errors='replace', line_buffering=True
+            )
+
+
+def check_dependencies():
+    """Check that required packages are installed."""
+    missing = []
+    try:
+        import requests  # noqa: F401
+    except ImportError:
+        missing.append("requests")
+    try:
+        from bs4 import BeautifulSoup  # noqa: F401
+    except ImportError:
+        missing.append("beautifulsoup4")
+    try:
+        from readability import Document  # noqa: F401
+    except ImportError:
+        missing.append("readability-lxml")
+    try:
+        import html2text  # noqa: F401
+    except ImportError:
+        missing.append("html2text")
+
+    if missing:
+        print(f"Error: missing dependencies: {', '.join(missing)}", file=sys.stderr)
+        print(f"Install with: pip install {' '.join(missing)}", file=sys.stderr)
+        sys.exit(1)
+
+
+def fetch_url(url, timeout=30):
+    """Fetch URL content with proper headers."""
+    import requests
+
+    headers = {
+        "User-Agent": (
+            "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 "
+            "(KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36"
+        ),
+        "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8",
+        "Accept-Language": "en-US,en;q=0.9,zh-CN;q=0.8,zh;q=0.7",
+        "Accept-Encoding": "gzip, deflate, br",
+    }
+
+    try:
+        resp = requests.get(url, headers=headers, timeout=timeout, allow_redirects=True)
+        resp.raise_for_status()
+
+        # Detect encoding
+        if resp.encoding and resp.encoding.lower() != 'utf-8':
+            resp.encoding = resp.apparent_encoding or resp.encoding
+
+        return resp.text, resp.url, resp.status_code
+    except requests.exceptions.Timeout:
+        print(f"Error: request timed out after {timeout}s", file=sys.stderr)
+        sys.exit(1)
+    except requests.exceptions.ConnectionError as e:
+        print(f"Error: connection failed: {e}", file=sys.stderr)
+        sys.exit(1)
+    except requests.exceptions.HTTPError as e:
+        print(f"Error: HTTP {e.response.status_code}: {e}", file=sys.stderr)
+        sys.exit(1)
+    except Exception as e:
+        print(f"Error: {e}", file=sys.stderr)
+        sys.exit(1)
+
+
+def extract_with_readability(html, url):
+    """Extract main article content using readability-lxml."""
+    from readability import Document
+
+    doc = Document(html, url=url)
+    title = doc.short_title()
+    content_html = doc.summary()
+    return title, content_html
+
+
+def extract_with_selector(html, selector):
+    """Extract content matching a CSS selector."""
+    from bs4 import BeautifulSoup
+
+    soup = BeautifulSoup(html, "html.parser")
+    elements = soup.select(selector)
+    if not elements:
+        return None
+
+    # Combine all matching elements
+    parts = []
+    for el in elements:
+        parts.append(str(el))
+    return "\n".join(parts)
+
+
+def html_to_markdown(html, base_url=None):
+    """Convert HTML to clean Markdown."""
+    import html2text
+
+    converter = html2text.HTML2Text()
+    converter.body_width = 0  # Don't wrap lines
+    converter.ignore_images = False
+    converter.ignore_links = False
+    converter.ignore_emphasis = False
+    converter.protect_links = True
+    converter.unicode_snob = True
+    converter.mark_code = True
+    converter.wrap_links = False
+    converter.single_line_break = False
+
+    if base_url:
+        converter.baseurl = base_url
+
+    md = converter.handle(html)
+
+    # Clean up excessive blank lines
+    import re
+    md = re.sub(r'\n{3,}', '\n\n', md)
+    return md.strip()
+
+
+def extract_metadata(html):
+    """Extract page metadata (title, description, etc.)."""
+    from bs4 import BeautifulSoup
+
+    soup = BeautifulSoup(html, "html.parser")
+    meta = {}
+
+    # Title
+    title_tag = soup.find("title")
+    if title_tag:
+        meta["title"] = title_tag.get_text(strip=True)
+
+    # Meta description
+    desc_tag = soup.find("meta", attrs={"name": "description"})
+    if desc_tag and desc_tag.get("content"):
+        meta["description"] = desc_tag["content"].strip()
+
+    # OG tags
+    for prop in ["og:title", "og:description", "og:type", "og:site_name"]:
+        tag = soup.find("meta", attrs={"property": prop})
+        if tag and tag.get("content"):
+            meta[prop.replace("og:", "og_")] = tag["content"].strip()
+
+    # Author
+    author_tag = soup.find("meta", attrs={"name": "author"})
+    if author_tag and author_tag.get("content"):
+        meta["author"] = author_tag["content"].strip()
+
+    # Published date
+    for attr in ["article:published_time", "datePublished", "date"]:
+        date_tag = soup.find("meta", attrs={"property": attr}) or soup.find("meta", attrs={"name": attr})
+        if date_tag and date_tag.get("content"):
+            meta["published"] = date_tag["content"].strip()
+            break
+
+    return meta
+
+
+def main():
+    setup_encoding()
+    check_dependencies()
+
+    parser = argparse.ArgumentParser(
+        description="Fetch a web page and extract content as Markdown"
+    )
+    parser.add_argument("url", help="URL to fetch")
+    parser.add_argument("--raw", action="store_true",
+                        help="Output full page Markdown (no readability extraction)")
+    parser.add_argument("--selector", type=str, default=None,
+                        help="CSS selector to extract specific elements")
+    parser.add_argument("--save", type=str, default=None,
+                        help="Also save output to this file path")
+    parser.add_argument("--max-length", type=int, default=None,
+                        help="Truncate output to N characters")
+    parser.add_argument("--timeout", type=int, default=30,
+                        help="Request timeout in seconds (default: 30)")
+    parser.add_argument("--no-metadata", action="store_true",
+                        help="Skip metadata header in output")
+
+    args = parser.parse_args()
+
+    # Normalize URL
+    url = args.url.strip()
+    if not url.startswith(("http://", "https://")):
+        url = "https://" + url
+
+    print(f"Fetching: {url}", file=sys.stderr)
+
+    # Fetch
+    html, final_url, status = fetch_url(url, timeout=args.timeout)
+    print(f"Status: {status}, Size: {len(html)} bytes", file=sys.stderr)
+
+    if final_url != url:
+        print(f"Redirected to: {final_url}", file=sys.stderr)
+
+    # Extract metadata
+    meta = extract_metadata(html) if not args.no_metadata else {}
+
+    # Extract content
+    if args.selector:
+        # CSS selector mode
+        selected_html = extract_with_selector(html, args.selector)
+        if not selected_html:
+            print(f"Warning: no elements matched selector '{args.selector}'", file=sys.stderr)
+            print(f"[No elements matched CSS selector: {args.selector}]")
+            sys.exit(0)
+        title = meta.get("title", "")
+        content_md = html_to_markdown(selected_html, base_url=final_url)
+    elif args.raw:
+        # Raw full-page mode
+        title = meta.get("title", "")
+        content_md = html_to_markdown(html, base_url=final_url)
+    else:
+        # Readability extraction mode (default)
+        title, article_html = extract_with_readability(html, final_url)
+        content_md = html_to_markdown(article_html, base_url=final_url)
+
+    # Build output
+    parts = []
+
+    if not args.no_metadata and meta:
+        parts.append(f"# {title or meta.get('title', 'Untitled')}")
+        parts.append(f"\n**Source**: {final_url}")
+        if meta.get("author"):
+            parts.append(f"**Author**: {meta['author']}")
+        if meta.get("published"):
+            parts.append(f"**Published**: {meta['published']}")
+        if meta.get("description"):
+            parts.append(f"**Description**: {meta['description']}")
+        parts.append("\n---\n")
+    elif title and not args.no_metadata:
+        parts.append(f"# {title}\n")
+
+    parts.append(content_md)
+
+    output = "\n".join(parts)
+
+    # Truncate if requested
+    if args.max_length and len(output) > args.max_length:
+        output = output[:args.max_length] + f"\n\n[... truncated at {args.max_length} characters, total {len(output)}]"
+
+    # Print to stdout
+    print(output)
+
+    content_length = len(content_md)
+    print(f"\nExtracted: {content_length} characters", file=sys.stderr)
+
+    # Save to file if requested
+    if args.save:
+        try:
+            with open(args.save, "w", encoding="utf-8") as f:
+                f.write(output)
+            print(f"Saved to: {args.save}", file=sys.stderr)
+        except Exception as e:
+            print(f"Error saving file: {e}", file=sys.stderr)
+
+
+if __name__ == "__main__":
+    main()
--- a/skills/aidotnet-web-scraper/scripts/search_web.py
+++ b/skills/aidotnet-web-scraper/scripts/search_web.py
@@ -0,0 +1,155 @@
+#!/usr/bin/env python3
+"""Search the web using DuckDuckGo and return structured results.
+
+No API key required. Returns results with title, URL, and snippet.
+
+Dependencies: pip install ddgs
+"""
+
+import sys
+import argparse
+import json
+
+
+def setup_encoding():
+    """Setup proper encoding for Windows console output."""
+    if sys.platform == "win32":
+        import io
+        try:
+            sys.stdout.reconfigure(encoding='utf-8', errors='replace')
+            sys.stderr.reconfigure(encoding='utf-8', errors='replace')
+        except (AttributeError, io.UnsupportedOperation):
+            sys.stdout = io.TextIOWrapper(
+                sys.stdout.buffer, encoding='utf-8', errors='replace', line_buffering=True
+            )
+            sys.stderr = io.TextIOWrapper(
+                sys.stderr.buffer, encoding='utf-8', errors='replace', line_buffering=True
+            )
+
+
+def check_dependencies():
+    """Check that required packages are installed."""
+    try:
+        from ddgs import DDGS  # noqa: F401
+    except ImportError:
+        try:
+            from duckduckgo_search import DDGS  # noqa: F401
+        except ImportError:
+            print("Error: ddgs not installed.", file=sys.stderr)
+            print("Install with: pip install ddgs", file=sys.stderr)
+            sys.exit(1)
+
+
+def _get_ddgs_class():
+    """Import DDGS from ddgs (new) or duckduckgo_search (legacy)."""
+    try:
+        from ddgs import DDGS
+        return DDGS
+    except ImportError:
+        from duckduckgo_search import DDGS
+        return DDGS
+
+
+def search_text(query, max_results=10, region="wt-wt"):
+    """Perform a text search."""
+    DDGS = _get_ddgs_class()
+    ddgs = DDGS()
+    try:
+        # New ddgs package: positional 'query' arg
+        results = list(ddgs.text(query, region=region, max_results=max_results))
+    except TypeError:
+        # Legacy duckduckgo_search: 'keywords' kwarg + context manager
+        with DDGS() as d:
+            results = list(d.text(keywords=query, region=region, max_results=max_results))
+    return results
+
+
+def search_news(query, max_results=10, region="wt-wt"):
+    """Perform a news search."""
+    DDGS = _get_ddgs_class()
+    ddgs = DDGS()
+    try:
+        results = list(ddgs.news(query, region=region, max_results=max_results))
+    except TypeError:
+        with DDGS() as d:
+            results = list(d.news(keywords=query, region=region, max_results=max_results))
+    return results
+
+
+def format_results_markdown(results, query, is_news=False):
+    """Format search results as Markdown."""
+    search_type = "News" if is_news else "Web"
+    parts = [f"# {search_type} Search Results: {query}\n"]
+    parts.append(f"Found **{len(results)}** results.\n")
+
+    for i, r in enumerate(results, 1):
+        title = r.get("title", "Untitled")
+        url = r.get("href") or r.get("url") or r.get("link", "")
+        body = r.get("body") or r.get("snippet", "")
+        date = r.get("date", "")
+
+        parts.append(f"## {i}. {title}")
+        parts.append(f"**URL**: {url}")
+        if date:
+            parts.append(f"**Date**: {date}")
+        if body:
+            parts.append(f"\n{body}")
+        parts.append("")  # blank line
+
+    return "\n".join(parts)
+
+
+def format_results_json(results):
+    """Format search results as JSON."""
+    return json.dumps(results, indent=2, ensure_ascii=False)
+
+
+def main():
+    setup_encoding()
+    check_dependencies()
+
+    parser = argparse.ArgumentParser(
+        description="Search the web via DuckDuckGo"
+    )
+    parser.add_argument("query", help="Search query")
+    parser.add_argument("--max-results", type=int, default=10,
+                        help="Number of results (default: 10)")
+    parser.add_argument("--region", type=str, default="wt-wt",
+                        help="Region code, e.g. cn-zh, us-en, jp-jp (default: wt-wt)")
+    parser.add_argument("--news", action="store_true",
+                        help="Search news instead of general web")
+    parser.add_argument("--json", action="store_true",
+                        help="Output as JSON instead of Markdown")
+
+    args = parser.parse_args()
+
+    query = args.query.strip()
+    if not query:
+        print("Error: empty query", file=sys.stderr)
+        sys.exit(1)
+
+    print(f"Searching: {query} (region={args.region}, max={args.max_results})", file=sys.stderr)
+
+    try:
+        if args.news:
+            results = search_news(query, args.max_results, args.region)
+        else:
+            results = search_text(query, args.max_results, args.region)
+    except Exception as e:
+        print(f"Error: search failed: {e}", file=sys.stderr)
+        sys.exit(1)
+
+    if not results:
+        print(f"No results found for: {query}")
+        sys.exit(0)
+
+    print(f"Got {len(results)} results", file=sys.stderr)
+
+    if args.json:
+        print(format_results_json(results))
+    else:
+        print(format_results_markdown(results, query, is_news=args.news))
+
+
+if __name__ == "__main__":
+    main()