1.3 KiB
1.3 KiB
Reddit r/selfhosted 게시문 초안
Title
[Project] Self-Healing System for AI Agents - Uses Claude Code as Emergency Doctor
Body
TL;DR: Built a 4-tier watchdog system that can autonomously diagnose and fix crashes using AI.
The Problem
Running AI agents (OpenClaw, similar to n8n but for AI) means dealing with random crashes at 3am. Traditional watchdogs just restart - they don't understand why things broke.
The Solution
4-tier escalation system:
| Level | Trigger | Action |
|---|---|---|
| 1 | Process dead | Restart (180s) |
| 2 | HTTP unhealthy | Retry 3x, then escalate (300s) |
| 3 | L2 failed | Launch Claude Code in tmux, diagnose, fix (30min) |
| 4 | L3 failed | Discord alert to human |
The Interesting Part
Level 3 launches Claude Code (Anthropic's CLI) in a tmux PTY session. It:
- Reads system logs
- Analyzes error patterns
- Identifies root cause
- Executes fixes autonomously
- Reports results
All logged. All auditable. 30-minute timeout for human intervention.
Security
- Isolated tmux session
- Cannot access main credentials
- Read-only config access
- Cleanup trap prevents orphans
Links
- GitHub: https://github.com/Ramsbaby/openclaw-self-healing
- Demo GIF in README
Would love feedback from the self-hosted community. Anyone else running AI agents 24/7?