5.0 KiB
title, published, description, tags, cover_image, canonical_url
| title | published | description | tags | cover_image | canonical_url |
|---|---|---|---|---|---|
| I Built a Self-Healing AI System Where Claude Code Acts as Emergency Doctor | false | 4-tier autonomous recovery for OpenClaw Gateway — featuring the world's first AI-powered diagnosis and repair system | ai, devops, automation, opensource | https://raw.githubusercontent.com/Ramsbaby/openclaw-self-healing/main/docs/images/architecture.png | https://github.com/Ramsbaby/openclaw-self-healing |
TL;DR
- Problem: AI agents crash at night, no one's awake to fix them
- Solution: 4-tier self-healing (Watchdog → Health Check → Claude Doctor → Alert)
- Result: Recovery time 30min → 5min, zero manual intervention for 90% of issues
- Unique: Claude Code as autonomous emergency doctor (world's first!)
The Wake-Up Call
"Jarvis, why aren't you responding?"
2 AM. My AI assistant was dead. Again.
The process was alive, but HTTP responses were timing out. Memory looked fine, but API calls were failing. Traditional process monitoring couldn't catch these "zombie" states.
The irony: An AI that runs 24/7 needs someone to watch it 24/7. But I need sleep.
So I built a system where AI heals AI.
Architecture: 4 Levels of Defense
┌─────────────────────────────────────────────────────────┐
│ Level 1: Watchdog (60s interval) │
│ └─ Process dead? → Restart │
└─────────────────────────────────────────────────────────┘
↓
┌─────────────────────────────────────────────────────────┐
│ Level 2: Health Check (300s interval) │
│ └─ HTTP 200 failing? → 3 retries → Level 3 │
└─────────────────────────────────────────────────────────┘
↓
┌─────────────────────────────────────────────────────────┐
│ Level 3: Claude Emergency Recovery (30m timeout) 🧠 │
│ ├─ Launch Claude Code in tmux PTY │
│ ├─ Autonomous diagnosis (logs, config, ports) │
│ ├─ Autonomous repair (fix & restart) │
│ └─ Generate recovery report │
└─────────────────────────────────────────────────────────┘
↓
┌─────────────────────────────────────────────────────────┐
│ Level 4: Human Alert (Discord notification) 🚨 │
│ └─ Only when AI doctor fails │
└─────────────────────────────────────────────────────────┘
The Secret Sauce: Claude as Doctor
Level 3 is where the magic happens. When Levels 1-2 fail, we spawn Claude Code in a tmux PTY session with this prompt:
You are an OpenClaw Gateway emergency doctor.
The gateway has been unresponsive for 5+ minutes.
Diagnose and fix the issue:
1. Check `openclaw status`
2. Analyze recent logs
3. Validate configuration
4. Check for port conflicts
5. Attempt repairs
6. Verify HTTP 200 response
You have 30 minutes. Save humanity.
Claude then autonomously:
- Reads logs and identifies patterns
- Checks configuration for errors
- Restarts services with fixes
- Validates the repair worked
It's like having a senior DevOps engineer on call 24/7.
Real-World Results
| Metric | Before | After |
|---|---|---|
| Avg recovery time | 30 min | 5 min |
| Night incidents resolved | 0% | 90% |
| Manual interventions/week | 5 | 0.5 |
The system has been running in production for 2 weeks. Level 3 (Claude Doctor) has been triggered twice and successfully resolved both issues without human intervention.
Try It Yourself
# One-click install
curl -sSL https://raw.githubusercontent.com/Ramsbaby/openclaw-self-healing/main/install.sh | bash
GitHub: https://github.com/Ramsbaby/openclaw-self-healing
ClawHub: clawhub install openclaw-self-healing
What's Next?
- Linux support (currently macOS only)
- Multi-node healing
- Cost optimization (Claude API isn't free!)
Have you built self-healing systems for AI agents? I'd love to hear your approach in the comments!
🦞 Built with love for the OpenClaw community