# OpenClaw Self-Healing System
> **"The system that heals itself โ or calls for help when it can't."**
A production-ready, 4-tier autonomous recovery system for [OpenClaw](https://github.com/openclaw/openclaw) Gateway, featuring AI-powered diagnosis and repair via Claude Code.
[](https://github.com/Ramsbaby/openclaw-self-healing/releases/tag/v2.0.1)
[](https://github.com/Ramsbaby/openclaw-self-healing/actions/workflows/shellcheck.yml)
[](https://opensource.org/licenses/MIT)
[](https://www.apple.com/macos/)
[](https://openclaw.ai/)
---
## ๐ฌ Demo

*The 4-tier recovery in action: Watchdog โ Health Check โ Claude Doctor โ Alert*
---
## ๐ Why This Exists
OpenClaw Gateway crashes happen. Health checks fail. Developers wake up to dead agents.
**This system watches your watcher.** When OpenClaw goes down, it:
1. **Restarts it** (Level 1-2, seconds)
2. **Diagnoses the problem** (Level 3, AI-powered)
3. **Fixes the root cause** (Level 3, autonomous)
4. **Alerts you** (Level 4, only if all else fails)
Unlike simple watchdogs that just restart processes, **this system understands _why_ things broke and how to fix them** โ thanks to Claude Code acting as an emergency doctor.
---
## ๐๏ธ Architecture
```
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ Level 1: Gateway KeepAlive (instant) โ
โ โโ LaunchAgent: ai.openclaw.gateway โ
โ โโ launchd auto-restart on crash โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ (if Gateway needs monitoring)
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ Level 2: Watchdog (180s interval) ๐ โ
โ โโ LaunchAgent: ai.openclaw.watchdog + KeepAlive โ
โ โโ PID check + HTTP health check โ
โ โโ Memory monitoring (1.5GB warning, 2GB critical) โ
โ โโ Exponential backoff (10s โ 600s) โ
โ โโ SIGUSR1 graceful restart or launchctl kickstart โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ (if Watchdog hangs/crashes)
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ Level 3: LaunchAgent Guardian (180s cron) ๐ก๏ธ โ
โ โโ Cron-based (independent from launchd) โ
โ โโ Detects "loaded but not running" state (PID -) โ
โ โโ Auto-kickstart hung services โ
โ โโ Discord alert on recovery โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ (monitoring for escalation)
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ Level 4: Discord Notification ๐จ โ
โ โโ 3 consecutive failures โ alert โ
โ โโ 15-minute cooldown between alerts โ
โ โโ Detailed failure context + logs โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
```
---
## โจ What Makes This Special
### 1. **AI-Powered Diagnosis** ๐ง
- **Claude Code** as an emergency doctor
- 30-minute autonomous troubleshooting session
- Generates human-readable recovery reports
- **First of its kind** for OpenClaw
### 2. **Production-Tested** โ
- Level 1 verified: Gateway KeepAlive auto-restart
- Level 2 verified: Watchdog v4 + KeepAlive (exponential backoff)
- Level 3 verified: 2026-02-07 20:07 (Guardian PID check โ kickstart recovery)
- Real failures, real logs, **real bug fixes** (v1.1.0)
### 3. **Meta-Level Self-Healing** ๐
- **"AI heals AI"** โ OpenClaw fixes OpenClaw
- Unlike external infrastructure monitors, this targets the agent itself
- Systematic escalation prevents false alarms
### 4. **Persistent Learning** ๐ *(NEW in v2.0)*
- Automatic recovery documentation (`recovery-learnings.md`)
- Cumulative knowledge base: symptom โ root cause โ solution โ prevention
- Claude learns from past incidents (addresses ContextVault feedback)
- Reasoning logs capture decision-making process
### 5. **Enhanced Observability** ๐ *(NEW in v2.0)*
- Metrics dashboard with success rate, avg recovery time
- Trending analysis (7-day window)
- Top symptoms and root causes tracking
- Explainable AI: understand why Claude chose specific fixes
### 6. **Multi-Channel Alerts** ๐ฑ *(NEW in v2.0)*
- Discord webhooks (original)
- Telegram bot support (new alternative)
- Configure one or both notification channels
### 7. **Safe by Design** ๐
- No secrets in code (`.env` for webhooks)
- Lock files prevent race conditions
- Atomic writes for alert tracking
- Automatic log rotation (14-day cleanup)
### 8. **Elegant Simplicity** ๐จ
- 4 bash scripts (~400 lines total)
- 1 LaunchAgent, 1 cron job
- Zero external dependencies (except tmux + Claude CLI + jq)
---
## โก One-Click Install (Recommended)
```bash
curl -sSL https://raw.githubusercontent.com/Ramsbaby/openclaw-self-healing/main/install.sh | bash
```
**That's it.** The installer will:
- โ
Check prerequisites (tmux, Claude CLI, OpenClaw)
- โ
Download and install all scripts
- โ
Set up the LaunchAgent
- โ
Configure environment
Custom workspace? Use:
```bash
curl -sSL https://raw.githubusercontent.com/Ramsbaby/openclaw-self-healing/main/install.sh | bash -s -- --workspace ~/my-openclaw
```
---
## ๐ Manual Installation (5 minutes)
Click to expand manual installation steps
### Prerequisites
- **macOS** 10.14+ (Catalina or later)
- **OpenClaw** installed and running
- **Homebrew** (for tmux)
- **Claude Code CLI** (`npm install -g @anthropic-ai/claude-code`)
### Installation
```bash
# 1. Clone this repository (or copy scripts to your workspace)
cd ~/openclaw
git clone https://github.com/ramsbaby/openclaw-self-healing.git
cd openclaw-self-healing
# 2. Install dependencies
brew install tmux
npm install -g @anthropic-ai/claude-code
# 3. Copy environment template
cp .env.example ~/.openclaw/.env
# 4. Edit .env with your Discord webhook (optional)
nano ~/.openclaw/.env
# Set DISCORD_WEBHOOK_URL to your webhook URL
# 5. Copy scripts to OpenClaw workspace
cp scripts/*.sh ~/openclaw/scripts/
cp scripts/launchd-guardian.sh ~/.openclaw/scripts/
chmod +x ~/openclaw/scripts/*.sh ~/.openclaw/scripts/*.sh
# 6. Load Watchdog LaunchAgent (v1.1.0+ with KeepAlive)
cp launchagent/ai.openclaw.watchdog.plist ~/Library/LaunchAgents/
launchctl bootstrap gui/$(id -u) ~/Library/LaunchAgents/ai.openclaw.watchdog.plist
# 7. Add Guardian cron (watches the watchdog)
(crontab -l 2>/dev/null; echo "*/3 * * * * /bin/bash ~/.openclaw/scripts/launchd-guardian.sh 2>/dev/null") | crontab -
# 8. Verify installation
launchctl list | grep openclaw.watchdog
# Expected: PID (running) or - (waiting for next interval)
```
### Verification
```bash
# Check Health Check is running
launchctl list | grep openclaw.healthcheck
# View Health Check logs
tail -f ~/openclaw/memory/healthcheck-$(date +%Y-%m-%d).log
# Simulate a crash (optional)
kill -9 $(pgrep -f openclaw-gateway)
# Wait 3 minutes, then check if it auto-recovered
curl http://localhost:18789/
```
Made with ๐ฆ and too much coffee by @ramsbaby
"The best system is one that fixes itself before you notice it's broken."