Files

Krilly 57dd294675 AI Newsletter Digest improvements: fixed QP soft line break decoding, URL extraction, and content cleaning

2026-03-04 13:29:22 +00:00

2.2 KiB

Raw Permalink Blame History

title, published, description, tags, cover_image

title	published	description	tags	cover_image
I Built an AI Doctor for My AI Agent — Here's How It Works	false	A 4-tier self-healing system that uses Claude Code to autonomously diagnose and fix crashes	ai, devops, automation, opensource	https://github.com/Ramsbaby/openclaw-self-healing/raw/main/assets/demo.gif

I Built an AI Doctor for My AI Agent

TL;DR: My AI agent kept crashing at 3am. So I built another AI to fix it.

The Problem

I run OpenClaw, an AI agent framework, on a Mac Mini. It's great until it crashes at 3am and I wake up to a dead assistant.

Traditional watchdogs just restart the process. They don't understand why it crashed.

The Solution: 4-Tier Self-Healing

Level 1: Watchdog (180s)     → Process dead? Restart.
Level 2: Health Check (300s) → HTTP failing? Retry 3x.
Level 3: Claude Doctor (30m) → AI diagnosis + autonomous fix 🧠
Level 4: Discord Alert       → Human escalation

The Interesting Part: Level 3

When Levels 1-2 fail, the system launches Claude Code (Anthropic's CLI) in a tmux PTY session:

tmux new-session -d -s emergency-recovery
tmux send-keys "claude --dangerously-skip-permissions" Enter
tmux send-keys "Gateway is down. Diagnose and fix." Enter

Claude then:

Runs openclaw status
Reads system logs
Identifies root cause (stale PID, port conflict, config error...)
Executes fixes
Verifies recovery

All logged. All auditable.

Security Model

Isolated tmux session (no main session access)
Read-only config access
30-minute hard timeout
Cleanup trap prevents orphan processes
Level 4 watchdog monitors Level 3

The Philosophy

"If we trust AI to write code, why not trust it to fix infrastructure?"

The AI already knows how to diagnose problems — it does it every day when developers ask for help. We just gave it permission to act on its diagnosis.

Results

Recovery time: 25 seconds (vs 8+ hours waiting for human)
False positives: 0 (so far)
My sleep quality: Improved 🛌

Try It

MIT licensed. Bash scripts only. Works on macOS (Linux guide included).

{% github Ramsbaby/openclaw-self-healing %}

What do you think? Would you trust an AI doctor for your infrastructure?

2.2 KiB Raw Permalink Blame History