45 lines
1.3 KiB
Markdown
45 lines
1.3 KiB
Markdown
# Reddit r/selfhosted 게시문 초안
|
|
|
|
## Title
|
|
[Project] Self-Healing System for AI Agents - Uses Claude Code as Emergency Doctor
|
|
|
|
## Body
|
|
**TL;DR:** Built a 4-tier watchdog system that can autonomously diagnose and fix crashes using AI.
|
|
|
|
---
|
|
|
|
## The Problem
|
|
Running AI agents (OpenClaw, similar to n8n but for AI) means dealing with random crashes at 3am. Traditional watchdogs just restart - they don't understand *why* things broke.
|
|
|
|
## The Solution
|
|
4-tier escalation system:
|
|
|
|
| Level | Trigger | Action |
|
|
|-------|---------|--------|
|
|
| 1 | Process dead | Restart (180s) |
|
|
| 2 | HTTP unhealthy | Retry 3x, then escalate (300s) |
|
|
| 3 | L2 failed | Launch Claude Code in tmux, diagnose, fix (30min) |
|
|
| 4 | L3 failed | Discord alert to human |
|
|
|
|
## The Interesting Part
|
|
Level 3 launches Claude Code (Anthropic's CLI) in a tmux PTY session. It:
|
|
- Reads system logs
|
|
- Analyzes error patterns
|
|
- Identifies root cause
|
|
- Executes fixes autonomously
|
|
- Reports results
|
|
|
|
All logged. All auditable. 30-minute timeout for human intervention.
|
|
|
|
## Security
|
|
- Isolated tmux session
|
|
- Cannot access main credentials
|
|
- Read-only config access
|
|
- Cleanup trap prevents orphans
|
|
|
|
## Links
|
|
- GitHub: https://github.com/Ramsbaby/openclaw-self-healing
|
|
- Demo GIF in README
|
|
|
|
Would love feedback from the self-hosted community. Anyone else running AI agents 24/7?
|