Reddit r/selfhosted 게시문 초안

Title

[Project] Self-Healing System for AI Agents - Uses Claude Code as Emergency Doctor

Body

TL;DR: Built a 4-tier watchdog system that can autonomously diagnose and fix crashes using AI.

The Problem

Running AI agents (OpenClaw, similar to n8n but for AI) means dealing with random crashes at 3am. Traditional watchdogs just restart - they don't understand why things broke.

The Solution

4-tier escalation system:

Level	Trigger	Action
1	Process dead	Restart (180s)
2	HTTP unhealthy	Retry 3x, then escalate (300s)
3	L2 failed	Launch Claude Code in tmux, diagnose, fix (30min)
4	L3 failed	Discord alert to human

The Interesting Part

Level 3 launches Claude Code (Anthropic's CLI) in a tmux PTY session. It:

Reads system logs
Analyzes error patterns
Identifies root cause
Executes fixes autonomously
Reports results

All logged. All auditable. 30-minute timeout for human intervention.

Security

Isolated tmux session
Cannot access main credentials
Read-only config access
Cleanup trap prevents orphans

1.3 KiB Raw Blame History