Files
openclaw-backups/skills/openclaw-self-healing/memory/reddit-post-draft.md

1.3 KiB

Reddit r/selfhosted 게시문 초안

Title

[Project] Self-Healing System for AI Agents - Uses Claude Code as Emergency Doctor

Body

TL;DR: Built a 4-tier watchdog system that can autonomously diagnose and fix crashes using AI.


The Problem

Running AI agents (OpenClaw, similar to n8n but for AI) means dealing with random crashes at 3am. Traditional watchdogs just restart - they don't understand why things broke.

The Solution

4-tier escalation system:

Level Trigger Action
1 Process dead Restart (180s)
2 HTTP unhealthy Retry 3x, then escalate (300s)
3 L2 failed Launch Claude Code in tmux, diagnose, fix (30min)
4 L3 failed Discord alert to human

The Interesting Part

Level 3 launches Claude Code (Anthropic's CLI) in a tmux PTY session. It:

  • Reads system logs
  • Analyzes error patterns
  • Identifies root cause
  • Executes fixes autonomously
  • Reports results

All logged. All auditable. 30-minute timeout for human intervention.

Security

  • Isolated tmux session
  • Cannot access main credentials
  • Read-only config access
  • Cleanup trap prevents orphans

Would love feedback from the self-hosted community. Anyone else running AI agents 24/7?