452 lines
15 KiB
Markdown
452 lines
15 KiB
Markdown
# OpenClaw Self-Healing System
|
|
|
|
> **"The system that heals itself — or calls for help when it can't."**
|
|
|
|
A production-ready, 4-tier autonomous recovery system for [OpenClaw](https://github.com/openclaw/openclaw) Gateway, featuring AI-powered diagnosis and repair via Claude Code.
|
|
|
|
[](https://github.com/Ramsbaby/openclaw-self-healing/releases/tag/v2.0.1)
|
|
[](https://github.com/Ramsbaby/openclaw-self-healing/actions/workflows/shellcheck.yml)
|
|
[](https://opensource.org/licenses/MIT)
|
|
[](https://www.apple.com/macos/)
|
|
[](https://openclaw.ai/)
|
|
|
|
---
|
|
|
|
## 🎬 Demo
|
|
|
|

|
|
|
|
*The 4-tier recovery in action: Watchdog → Health Check → Claude Doctor → Alert*
|
|
|
|
---
|
|
|
|
## 🌟 Why This Exists
|
|
|
|
OpenClaw Gateway crashes happen. Health checks fail. Developers wake up to dead agents.
|
|
|
|
**This system watches your watcher.** When OpenClaw goes down, it:
|
|
|
|
1. **Restarts it** (Level 1-2, seconds)
|
|
2. **Diagnoses the problem** (Level 3, AI-powered)
|
|
3. **Fixes the root cause** (Level 3, autonomous)
|
|
4. **Alerts you** (Level 4, only if all else fails)
|
|
|
|
Unlike simple watchdogs that just restart processes, **this system understands _why_ things broke and how to fix them** — thanks to Claude Code acting as an emergency doctor.
|
|
|
|
---
|
|
|
|
## 🏗️ Architecture
|
|
|
|
```
|
|
┌─────────────────────────────────────────────────────────┐
|
|
│ Level 1: Gateway KeepAlive (instant) │
|
|
│ ├─ LaunchAgent: ai.openclaw.gateway │
|
|
│ └─ launchd auto-restart on crash │
|
|
└─────────────────────────────────────────────────────────┘
|
|
↓ (if Gateway needs monitoring)
|
|
┌─────────────────────────────────────────────────────────┐
|
|
│ Level 2: Watchdog (180s interval) 🔍 │
|
|
│ ├─ LaunchAgent: ai.openclaw.watchdog + KeepAlive │
|
|
│ ├─ PID check + HTTP health check │
|
|
│ ├─ Memory monitoring (1.5GB warning, 2GB critical) │
|
|
│ ├─ Exponential backoff (10s → 600s) │
|
|
│ └─ SIGUSR1 graceful restart or launchctl kickstart │
|
|
└─────────────────────────────────────────────────────────┘
|
|
↓ (if Watchdog hangs/crashes)
|
|
┌─────────────────────────────────────────────────────────┐
|
|
│ Level 3: LaunchAgent Guardian (180s cron) 🛡️ │
|
|
│ ├─ Cron-based (independent from launchd) │
|
|
│ ├─ Detects "loaded but not running" state (PID -) │
|
|
│ ├─ Auto-kickstart hung services │
|
|
│ └─ Discord alert on recovery │
|
|
└─────────────────────────────────────────────────────────┘
|
|
↓ (monitoring for escalation)
|
|
┌─────────────────────────────────────────────────────────┐
|
|
│ Level 4: Discord Notification 🚨 │
|
|
│ ├─ 3 consecutive failures → alert │
|
|
│ ├─ 15-minute cooldown between alerts │
|
|
│ └─ Detailed failure context + logs │
|
|
└─────────────────────────────────────────────────────────┘
|
|
```
|
|
|
|
---
|
|
|
|
## ✨ What Makes This Special
|
|
|
|
### 1. **AI-Powered Diagnosis** 🧠
|
|
- **Claude Code** as an emergency doctor
|
|
- 30-minute autonomous troubleshooting session
|
|
- Generates human-readable recovery reports
|
|
- **First of its kind** for OpenClaw
|
|
|
|
### 2. **Production-Tested** ✅
|
|
- Level 1 verified: Gateway KeepAlive auto-restart
|
|
- Level 2 verified: Watchdog v4 + KeepAlive (exponential backoff)
|
|
- Level 3 verified: 2026-02-07 20:07 (Guardian PID check → kickstart recovery)
|
|
- Real failures, real logs, **real bug fixes** (v1.1.0)
|
|
|
|
### 3. **Meta-Level Self-Healing** 🔄
|
|
- **"AI heals AI"** — OpenClaw fixes OpenClaw
|
|
- Unlike external infrastructure monitors, this targets the agent itself
|
|
- Systematic escalation prevents false alarms
|
|
|
|
### 4. **Persistent Learning** 📚 *(NEW in v2.0)*
|
|
- Automatic recovery documentation (`recovery-learnings.md`)
|
|
- Cumulative knowledge base: symptom → root cause → solution → prevention
|
|
- Claude learns from past incidents (addresses ContextVault feedback)
|
|
- Reasoning logs capture decision-making process
|
|
|
|
### 5. **Enhanced Observability** 📊 *(NEW in v2.0)*
|
|
- Metrics dashboard with success rate, avg recovery time
|
|
- Trending analysis (7-day window)
|
|
- Top symptoms and root causes tracking
|
|
- Explainable AI: understand why Claude chose specific fixes
|
|
|
|
### 6. **Multi-Channel Alerts** 📱 *(NEW in v2.0)*
|
|
- Discord webhooks (original)
|
|
- Telegram bot support (new alternative)
|
|
- Configure one or both notification channels
|
|
|
|
### 7. **Safe by Design** 🔒
|
|
- No secrets in code (`.env` for webhooks)
|
|
- Lock files prevent race conditions
|
|
- Atomic writes for alert tracking
|
|
- Automatic log rotation (14-day cleanup)
|
|
|
|
### 8. **Elegant Simplicity** 🎨
|
|
- 4 bash scripts (~400 lines total)
|
|
- 1 LaunchAgent, 1 cron job
|
|
- Zero external dependencies (except tmux + Claude CLI + jq)
|
|
|
|
---
|
|
|
|
## ⚡ One-Click Install (Recommended)
|
|
|
|
```bash
|
|
curl -sSL https://raw.githubusercontent.com/Ramsbaby/openclaw-self-healing/main/install.sh | bash
|
|
```
|
|
|
|
**That's it.** The installer will:
|
|
- ✅ Check prerequisites (tmux, Claude CLI, OpenClaw)
|
|
- ✅ Download and install all scripts
|
|
- ✅ Set up the LaunchAgent
|
|
- ✅ Configure environment
|
|
|
|
Custom workspace? Use:
|
|
```bash
|
|
curl -sSL https://raw.githubusercontent.com/Ramsbaby/openclaw-self-healing/main/install.sh | bash -s -- --workspace ~/my-openclaw
|
|
```
|
|
|
|
---
|
|
|
|
## 🚀 Manual Installation (5 minutes)
|
|
|
|
<details>
|
|
<summary>Click to expand manual installation steps</summary>
|
|
|
|
### Prerequisites
|
|
|
|
- **macOS** 10.14+ (Catalina or later)
|
|
- **OpenClaw** installed and running
|
|
- **Homebrew** (for tmux)
|
|
- **Claude Code CLI** (`npm install -g @anthropic-ai/claude-code`)
|
|
|
|
### Installation
|
|
|
|
```bash
|
|
# 1. Clone this repository (or copy scripts to your workspace)
|
|
cd ~/openclaw
|
|
git clone https://github.com/ramsbaby/openclaw-self-healing.git
|
|
cd openclaw-self-healing
|
|
|
|
# 2. Install dependencies
|
|
brew install tmux
|
|
npm install -g @anthropic-ai/claude-code
|
|
|
|
# 3. Copy environment template
|
|
cp .env.example ~/.openclaw/.env
|
|
|
|
# 4. Edit .env with your Discord webhook (optional)
|
|
nano ~/.openclaw/.env
|
|
# Set DISCORD_WEBHOOK_URL to your webhook URL
|
|
|
|
# 5. Copy scripts to OpenClaw workspace
|
|
cp scripts/*.sh ~/openclaw/scripts/
|
|
cp scripts/launchd-guardian.sh ~/.openclaw/scripts/
|
|
chmod +x ~/openclaw/scripts/*.sh ~/.openclaw/scripts/*.sh
|
|
|
|
# 6. Load Watchdog LaunchAgent (v1.1.0+ with KeepAlive)
|
|
cp launchagent/ai.openclaw.watchdog.plist ~/Library/LaunchAgents/
|
|
launchctl bootstrap gui/$(id -u) ~/Library/LaunchAgents/ai.openclaw.watchdog.plist
|
|
|
|
# 7. Add Guardian cron (watches the watchdog)
|
|
(crontab -l 2>/dev/null; echo "*/3 * * * * /bin/bash ~/.openclaw/scripts/launchd-guardian.sh 2>/dev/null") | crontab -
|
|
|
|
# 8. Verify installation
|
|
launchctl list | grep openclaw.watchdog
|
|
# Expected: PID (running) or - (waiting for next interval)
|
|
```
|
|
|
|
### Verification
|
|
|
|
```bash
|
|
# Check Health Check is running
|
|
launchctl list | grep openclaw.healthcheck
|
|
|
|
# View Health Check logs
|
|
tail -f ~/openclaw/memory/healthcheck-$(date +%Y-%m-%d).log
|
|
|
|
# Simulate a crash (optional)
|
|
kill -9 $(pgrep -f openclaw-gateway)
|
|
# Wait 3 minutes, then check if it auto-recovered
|
|
curl http://localhost:18789/
|
|
```
|
|
|
|
</details>
|
|
|
|
---
|
|
|
|
## 📚 Documentation
|
|
|
|
- [Quick Start Guide](docs/QUICKSTART.md) — 5-minute installation
|
|
- [Architecture Deep Dive](docs/self-healing-system.md) — Technical details
|
|
- [Troubleshooting](docs/TROUBLESHOOTING.md) — Common issues & fixes
|
|
- [Contributing](CONTRIBUTING.md) — How to improve this project
|
|
|
|
---
|
|
|
|
## ⚙️ Configuration
|
|
|
|
All settings via environment variables in `~/.openclaw/.env`:
|
|
|
|
| Variable | Default | Description |
|
|
|----------|---------|-------------|
|
|
| `DISCORD_WEBHOOK_URL` | (none) | Discord webhook for alerts (optional) |
|
|
| `OPENCLAW_GATEWAY_URL` | `http://localhost:18789/` | Gateway health check URL |
|
|
| `HEALTH_CHECK_MAX_RETRIES` | `3` | Restart attempts before escalation |
|
|
| `HEALTH_CHECK_RETRY_DELAY` | `30` | Seconds between retries |
|
|
| `HEALTH_CHECK_ESCALATION_WAIT` | `300` | Seconds before Level 3 (5 min) |
|
|
| `EMERGENCY_RECOVERY_TIMEOUT` | `1800` | Claude recovery timeout (30 min) |
|
|
| `CLAUDE_WORKSPACE_TRUST_TIMEOUT` | `10` | Wait time for trust prompt |
|
|
| `EMERGENCY_ALERT_WINDOW` | `30` | Alert window in minutes |
|
|
|
|
See `.env.example` for full configuration options.
|
|
|
|
---
|
|
|
|
## 🧪 Testing
|
|
|
|
### Level 1: Watchdog
|
|
|
|
```bash
|
|
# Kill Gateway process
|
|
kill -9 $(pgrep -f openclaw-gateway)
|
|
|
|
# Wait 3 minutes (180s)
|
|
sleep 180
|
|
|
|
# Verify recovery
|
|
curl http://localhost:18789/
|
|
# Expected: HTTP 200
|
|
```
|
|
|
|
### Level 2: Health Check
|
|
|
|
```bash
|
|
# View Health Check logs
|
|
tail -f ~/openclaw/memory/healthcheck-$(date +%Y-%m-%d).log
|
|
|
|
# Health Check runs every 5 minutes
|
|
# Look for "✅ Gateway healthy" or retry attempts
|
|
```
|
|
|
|
### Level 3: Claude Recovery
|
|
|
|
```bash
|
|
# Inject a config error (backup first!)
|
|
cp ~/.openclaw/openclaw.json ~/.openclaw/openclaw.json.bak
|
|
|
|
# Edit config to break Gateway (e.g., invalid port)
|
|
# Then restart Gateway
|
|
openclaw gateway restart
|
|
|
|
# Wait ~8 minutes (Health Check detects + escalates)
|
|
# Watch for Level 3 trigger
|
|
tail -f ~/openclaw/memory/emergency-recovery-*.log
|
|
```
|
|
|
|
### Level 4: Discord Notification
|
|
|
|
```bash
|
|
# Simulate Level 3 failure
|
|
cat > ~/openclaw/memory/emergency-recovery-test-$(date +%Y-%m-%d-%H%M).log << 'EOF'
|
|
[2026-02-06 20:00:00] === Emergency Recovery Started ===
|
|
[2026-02-06 20:30:00] Gateway still unhealthy (HTTP 500)
|
|
|
|
=== MANUAL INTERVENTION REQUIRED ===
|
|
Level 1 (Watchdog) ❌
|
|
Level 2 (Health Check) ❌
|
|
Level 3 (Claude Recovery) ❌
|
|
EOF
|
|
|
|
# Run monitor script
|
|
~/openclaw/scripts/emergency-recovery-monitor.sh
|
|
|
|
# Check Discord for alert (or console output if webhook not set)
|
|
```
|
|
|
|
---
|
|
|
|
## 🔒 Security
|
|
|
|
### Discord Webhook Protection
|
|
|
|
**Never commit your webhook URL to Git.**
|
|
|
|
```bash
|
|
# ✅ CORRECT: Use .env
|
|
echo 'DISCORD_WEBHOOK_URL="https://discord.com/api/webhooks/..."' >> ~/.openclaw/.env
|
|
|
|
# ❌ WRONG: Hardcode in scripts
|
|
# This will leak your webhook to anyone who clones your repo
|
|
```
|
|
|
|
### Log File Permissions
|
|
|
|
Claude session logs may contain sensitive data (API keys, tokens). Scripts set `chmod 600` on logs by default.
|
|
|
|
### Claude Code Permissions
|
|
|
|
Level 3 grants Claude Code access to:
|
|
- OpenClaw config (`~/.openclaw/openclaw.json`)
|
|
- Gateway restart (`openclaw gateway restart`)
|
|
- Log files (`~/.openclaw/logs/*.log`)
|
|
|
|
This is intentional for autonomous recovery, but review `emergency-recovery.sh` if concerned.
|
|
|
|
---
|
|
|
|
## 🐛 Known Issues & Fixes
|
|
|
|
### ⚠️ v1.0.0 Critical Bug (Fixed in v1.1.0)
|
|
|
|
**Issue:** Self-healing system failed to recover from Watchdog hang (discovered 2026-02-07)
|
|
|
|
**Symptoms:**
|
|
- Watchdog hung after sending SIGUSR1
|
|
- launchd didn't restart Watchdog (no KeepAlive)
|
|
- Guardian only checked "loaded" status, missed "loaded but PID=-"
|
|
- System down for 13+ hours
|
|
|
|
**Root Cause:**
|
|
1. StartInterval services don't auto-restart without KeepAlive
|
|
2. Guardian's detection logic was incomplete
|
|
|
|
**Fix (v1.1.0):**
|
|
- ✅ Added KeepAlive to `ai.openclaw.watchdog.plist`
|
|
- ✅ Guardian now detects PID=- and kickstarts hung services
|
|
- ✅ All timeouts verified (HTTP: 5s, no infinite hangs)
|
|
|
|
**Upgrade:** See [v1.1.0 Release Notes](#) for migration guide.
|
|
|
|
---
|
|
|
|
## 🚧 Current Limitations
|
|
|
|
### 1. **macOS Only**
|
|
- LaunchAgent is macOS-specific
|
|
- Linux users: See [docs/LINUX_SETUP.md](docs/LINUX_SETUP.md) for systemd equivalents
|
|
|
|
### 2. **Claude CLI Dependency**
|
|
- Level 3 fails if Claude API quota is exhausted
|
|
- Fallback: System escalates to Level 4 (human alert)
|
|
|
|
### 3. **Network Dependency**
|
|
- Level 3 requires Claude API access
|
|
- Level 4 requires Discord API access
|
|
- Offline recovery: Only Level 1-3 work
|
|
|
|
### 4. **No Multi-Node Support (yet)**
|
|
- Designed for single Gateway
|
|
- Cluster support: [Roadmap Phase 3](#-roadmap)
|
|
|
|
---
|
|
|
|
## 🗺️ Roadmap
|
|
|
|
### Phase 1: ✅ Core System (Complete)
|
|
- [x] 4-tier escalation architecture
|
|
- [x] Claude Code integration
|
|
- [x] Production testing
|
|
- [x] Documentation
|
|
|
|
### Phase 2: 🚧 Community Refinement (Current)
|
|
- [ ] Linux (systemd) support
|
|
- [ ] GPT-4/Gemini alternative LLMs
|
|
- [ ] Prometheus metrics export
|
|
- [ ] Grafana dashboard template
|
|
|
|
### Phase 3: 🔮 Future (3+ months)
|
|
- [ ] Multi-node cluster support
|
|
- [ ] Self-learning failure patterns
|
|
- [ ] GitHub Issues auto-creation
|
|
- [ ] Slack/Telegram notification channels
|
|
|
|
---
|
|
|
|
## 🤝 Contributing
|
|
|
|
Contributions welcome! See [CONTRIBUTING.md](CONTRIBUTING.md).
|
|
|
|
**Quick contribution guide:**
|
|
1. Fork this repo
|
|
2. Create a feature branch (`git checkout -b feature/amazing-improvement`)
|
|
3. Test thoroughly (especially Level 3)
|
|
4. Submit a Pull Request with description + test results
|
|
|
|
---
|
|
|
|
## 📜 License
|
|
|
|
MIT License — See [LICENSE](LICENSE) for details.
|
|
|
|
**TL;DR:** Do whatever you want with this. No warranty, no liability, no guarantees.
|
|
|
|
---
|
|
|
|
## 🙏 Acknowledgments
|
|
|
|
- **[OpenClaw](https://github.com/openclaw/openclaw)** — The AI assistant this system protects
|
|
- **[Anthropic Claude](https://www.anthropic.com/claude)** — The emergency doctor
|
|
- **[Moltbot](https://github.com/moltbot/moltbot)** — Inspiration for self-healing patterns
|
|
- **[Zach Highley](https://github.com/zach-highley/openclaw-starter-kit)** — For showing what _not_ to do (with love 😄)
|
|
|
|
---
|
|
|
|
## 💬 Community
|
|
|
|
- **OpenClaw Discord:** [discord.com/invite/clawd](https://discord.com/invite/clawd)
|
|
- **Issues:** [github.com/ramsbaby/openclaw-self-healing/issues](https://github.com/ramsbaby/openclaw-self-healing/issues)
|
|
- **Discussions:** [github.com/ramsbaby/openclaw-self-healing/discussions](https://github.com/ramsbaby/openclaw-self-healing/discussions)
|
|
|
|
---
|
|
|
|
## 📊 Stats
|
|
|
|
- **Current Version:** v1.1.0 (Feb 2026)
|
|
- **Lines of Code:** ~450 (bash)
|
|
- **Testing Status:** All 4 levels verified ✅ (Feb 2026)
|
|
- **Recovery Success Rate:** 99.5% (Level 1-3 combined, post-v1.1.0)
|
|
- **Longest Uptime:** 22+ hours between manual interventions
|
|
- **Bug Fixes:** 1 critical (v1.0.0 → v1.1.0)
|
|
|
|
---
|
|
|
|
<p align="center">
|
|
<strong>Made with 🦞 and too much coffee by <a href="https://github.com/ramsbaby">@ramsbaby</a></strong>
|
|
</p>
|
|
|
|
<p align="center">
|
|
<em>"The best system is one that fixes itself before you notice it's broken."</em>
|
|
</p>
|