AI Newsletter Digest improvements: fixed QP soft line break decoding, URL extraction, and content cleaning
This commit is contained in:
99
skills/openclaw-self-healing/references/LINUX_SETUP.md
Normal file
99
skills/openclaw-self-healing/references/LINUX_SETUP.md
Normal file
@@ -0,0 +1,99 @@
|
||||
# Linux Setup Guide (systemd)
|
||||
|
||||
> ⚠️ **Work in Progress** - This is a community contribution template. Full Linux support is on the roadmap.
|
||||
|
||||
## Overview
|
||||
|
||||
This guide provides systemd equivalents for the macOS LaunchAgent-based self-healing system.
|
||||
|
||||
## Prerequisites
|
||||
|
||||
- Linux (Ubuntu 20.04+, Debian 11+, or similar)
|
||||
- systemd
|
||||
- OpenClaw Gateway installed
|
||||
- tmux (`apt install tmux`)
|
||||
- Claude CLI (`npm install -g @anthropic-ai/claude-code`)
|
||||
|
||||
## Level 1: Watchdog (systemd)
|
||||
|
||||
Create `/etc/systemd/system/openclaw-gateway.service`:
|
||||
|
||||
```ini
|
||||
[Unit]
|
||||
Description=OpenClaw Gateway
|
||||
After=network.target
|
||||
|
||||
[Service]
|
||||
Type=simple
|
||||
User=YOUR_USER
|
||||
WorkingDirectory=/home/YOUR_USER
|
||||
ExecStart=/usr/local/bin/openclaw gateway start
|
||||
Restart=always
|
||||
RestartSec=180
|
||||
|
||||
[Install]
|
||||
WantedBy=multi-user.target
|
||||
```
|
||||
|
||||
Enable and start:
|
||||
```bash
|
||||
sudo systemctl enable openclaw-gateway
|
||||
sudo systemctl start openclaw-gateway
|
||||
```
|
||||
|
||||
## Level 2: Health Check (systemd timer)
|
||||
|
||||
Create `/etc/systemd/system/openclaw-healthcheck.service`:
|
||||
|
||||
```ini
|
||||
[Unit]
|
||||
Description=OpenClaw Health Check
|
||||
|
||||
[Service]
|
||||
Type=oneshot
|
||||
User=YOUR_USER
|
||||
ExecStart=/home/YOUR_USER/openclaw/scripts/gateway-healthcheck.sh
|
||||
```
|
||||
|
||||
Create `/etc/systemd/system/openclaw-healthcheck.timer`:
|
||||
|
||||
```ini
|
||||
[Unit]
|
||||
Description=Run OpenClaw Health Check every 5 minutes
|
||||
|
||||
[Timer]
|
||||
OnBootSec=5min
|
||||
OnUnitActiveSec=5min
|
||||
|
||||
[Install]
|
||||
WantedBy=timers.target
|
||||
```
|
||||
|
||||
Enable:
|
||||
```bash
|
||||
sudo systemctl enable openclaw-healthcheck.timer
|
||||
sudo systemctl start openclaw-healthcheck.timer
|
||||
```
|
||||
|
||||
## Level 3 & 4
|
||||
|
||||
Scripts work the same on Linux. Update paths in `.env`:
|
||||
|
||||
```bash
|
||||
OPENCLAW_DIR=/home/YOUR_USER/openclaw
|
||||
LOG_DIR=/home/YOUR_USER/openclaw/memory
|
||||
```
|
||||
|
||||
## Script Modifications
|
||||
|
||||
Replace macOS-specific commands:
|
||||
|
||||
| macOS | Linux |
|
||||
|-------|-------|
|
||||
| `launchctl` | `systemctl` |
|
||||
| `~/Library/LaunchAgents/` | `/etc/systemd/system/` |
|
||||
| `open` | `xdg-open` |
|
||||
|
||||
## Contributing
|
||||
|
||||
Help us improve Linux support! See [CONTRIBUTING.md](/CONTRIBUTING.md).
|
||||
@@ -0,0 +1,584 @@
|
||||
# Troubleshooting Guide
|
||||
|
||||
> **Common issues and solutions for OpenClaw Self-Healing System**
|
||||
|
||||
---
|
||||
|
||||
## 🔍 Diagnostic Commands
|
||||
|
||||
Before diving into specific issues, run these diagnostic commands:
|
||||
|
||||
```bash
|
||||
# 1. Check LaunchAgent status
|
||||
launchctl list | grep openclaw
|
||||
|
||||
# 2. Check Health Check logs
|
||||
tail -50 ~/openclaw/memory/healthcheck-$(date +%Y-%m-%d).log
|
||||
|
||||
# 3. Check Emergency Recovery logs
|
||||
ls -lt ~/openclaw/memory/emergency-recovery-*.log | head -5
|
||||
|
||||
# 4. Check Gateway status
|
||||
openclaw status
|
||||
|
||||
# 5. Check cron jobs
|
||||
openclaw cron list | grep -i "emergency\|health"
|
||||
|
||||
# 6. Check script permissions
|
||||
ls -lh ~/openclaw/scripts/*.sh
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 🚨 Level 1: Watchdog Issues
|
||||
|
||||
### Issue: Watchdog not restarting Gateway
|
||||
|
||||
**Symptoms:**
|
||||
- Gateway crashes but doesn't restart
|
||||
- No automatic recovery after 3 minutes
|
||||
|
||||
**Diagnosis:**
|
||||
```bash
|
||||
# Check if Watchdog LaunchAgent is loaded
|
||||
launchctl list | grep openclaw.watchdog
|
||||
|
||||
# Expected output:
|
||||
# - 0 ai.openclaw.watchdog
|
||||
```
|
||||
|
||||
**Solution 1: Watchdog not loaded**
|
||||
```bash
|
||||
# Check if plist exists
|
||||
ls ~/Library/LaunchAgents/ai.openclaw.watchdog.plist
|
||||
|
||||
# If missing, reinstall OpenClaw:
|
||||
npm install -g openclaw
|
||||
openclaw onboard --install-daemon
|
||||
```
|
||||
|
||||
**Solution 2: Watchdog disabled**
|
||||
```bash
|
||||
# Reload Watchdog
|
||||
launchctl unload ~/Library/LaunchAgents/ai.openclaw.watchdog.plist
|
||||
launchctl load ~/Library/LaunchAgents/ai.openclaw.watchdog.plist
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 🏥 Level 2: Health Check Issues
|
||||
|
||||
### Issue: Health Check not running
|
||||
|
||||
**Symptoms:**
|
||||
- No `healthcheck-*.log` files in `~/openclaw/memory/`
|
||||
- LaunchAgent listed but no activity
|
||||
|
||||
**Diagnosis:**
|
||||
```bash
|
||||
# Check LaunchAgent status
|
||||
launchctl list | grep openclaw.healthcheck
|
||||
|
||||
# Check LaunchAgent logs
|
||||
tail -f ~/Library/Logs/com.openclaw.healthcheck.log
|
||||
```
|
||||
|
||||
**Solution 1: LaunchAgent not loaded**
|
||||
```bash
|
||||
launchctl load ~/Library/LaunchAgents/com.openclaw.healthcheck.plist
|
||||
```
|
||||
|
||||
**Solution 2: Script path wrong**
|
||||
```bash
|
||||
# Check plist file
|
||||
cat ~/Library/LaunchAgents/com.openclaw.healthcheck.plist | grep ProgramArguments -A 2
|
||||
|
||||
# Should point to: ~/openclaw/scripts/gateway-healthcheck.sh
|
||||
# If wrong, edit plist:
|
||||
nano ~/Library/LaunchAgents/com.openclaw.healthcheck.plist
|
||||
|
||||
# Reload after edit:
|
||||
launchctl unload ~/Library/LaunchAgents/com.openclaw.healthcheck.plist
|
||||
launchctl load ~/Library/LaunchAgents/com.openclaw.healthcheck.plist
|
||||
```
|
||||
|
||||
**Solution 3: Script not executable**
|
||||
```bash
|
||||
chmod +x ~/openclaw/scripts/gateway-healthcheck.sh
|
||||
```
|
||||
|
||||
**Solution 4: Run manually to test**
|
||||
```bash
|
||||
bash ~/openclaw/scripts/gateway-healthcheck.sh
|
||||
|
||||
# Check for errors in output
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
### Issue: Health Check false positives
|
||||
|
||||
**Symptoms:**
|
||||
- Health Check reports failure but Gateway is running fine
|
||||
- Unnecessary restarts
|
||||
|
||||
**Diagnosis:**
|
||||
```bash
|
||||
# Check Gateway URL
|
||||
curl -I http://localhost:18789/
|
||||
|
||||
# Check environment variable
|
||||
source ~/.openclaw/.env
|
||||
echo $OPENCLAW_GATEWAY_URL
|
||||
```
|
||||
|
||||
**Solution: Wrong Gateway URL**
|
||||
```bash
|
||||
# Edit .env
|
||||
nano ~/.openclaw/.env
|
||||
|
||||
# Set correct URL:
|
||||
OPENCLAW_GATEWAY_URL="http://localhost:18789/"
|
||||
|
||||
# Reload LaunchAgent
|
||||
launchctl unload ~/Library/LaunchAgents/com.openclaw.healthcheck.plist
|
||||
launchctl load ~/Library/LaunchAgents/com.openclaw.healthcheck.plist
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
### Issue: Health Check restarts too aggressively
|
||||
|
||||
**Symptoms:**
|
||||
- Gateway restarts multiple times per hour
|
||||
- Unstable system
|
||||
|
||||
**Diagnosis:**
|
||||
```bash
|
||||
# Check retry settings
|
||||
source ~/.openclaw/.env
|
||||
echo "Max retries: ${HEALTH_CHECK_MAX_RETRIES:-3}"
|
||||
echo "Retry delay: ${HEALTH_CHECK_RETRY_DELAY:-30}s"
|
||||
```
|
||||
|
||||
**Solution: Increase thresholds**
|
||||
```bash
|
||||
# Edit .env
|
||||
nano ~/.openclaw/.env
|
||||
|
||||
# Add/modify:
|
||||
HEALTH_CHECK_MAX_RETRIES=5
|
||||
HEALTH_CHECK_RETRY_DELAY=60
|
||||
HEALTH_CHECK_ESCALATION_WAIT=600
|
||||
|
||||
# Reload LaunchAgent
|
||||
launchctl unload ~/Library/LaunchAgents/com.openclaw.healthcheck.plist
|
||||
launchctl load ~/Library/LaunchAgents/com.openclaw.healthcheck.plist
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 🧠 Level 3: Claude Recovery Issues
|
||||
|
||||
### Issue: Claude CLI not found
|
||||
|
||||
**Symptoms:**
|
||||
- Emergency Recovery logs show: `❌ Missing dependencies: claude`
|
||||
- Level 3 skips to Level 4
|
||||
|
||||
**Diagnosis:**
|
||||
```bash
|
||||
# Check Claude installation
|
||||
which claude
|
||||
claude --version
|
||||
```
|
||||
|
||||
**Solution: Install Claude CLI**
|
||||
```bash
|
||||
npm install -g @anthropic-ai/claude-code
|
||||
|
||||
# Verify
|
||||
claude --version
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
### Issue: Claude session fails to start
|
||||
|
||||
**Symptoms:**
|
||||
- Emergency Recovery logs show: `Starting Claude Code session...`
|
||||
- Then: `⚠️ Claude workspace trust prompt not detected`
|
||||
|
||||
**Diagnosis:**
|
||||
```bash
|
||||
# Check tmux
|
||||
which tmux
|
||||
tmux -V
|
||||
|
||||
# Test Claude manually
|
||||
claude
|
||||
# Does it prompt "trust this workspace"?
|
||||
```
|
||||
|
||||
**Solution 1: tmux not installed**
|
||||
```bash
|
||||
brew install tmux
|
||||
```
|
||||
|
||||
**Solution 2: Claude workspace already trusted**
|
||||
```bash
|
||||
# This is actually OK — script proceeds anyway
|
||||
# Check recovery logs for actual failure reason
|
||||
tail -50 ~/openclaw/memory/emergency-recovery-*.log
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
### Issue: Claude API quota exceeded
|
||||
|
||||
**Symptoms:**
|
||||
- Emergency Recovery logs show: `⚠️ Claude API rate limited or quota exceeded`
|
||||
- Level 3 fails immediately
|
||||
|
||||
**Diagnosis:**
|
||||
```bash
|
||||
# Check Claude usage
|
||||
claude
|
||||
# Type: /usage
|
||||
# Check remaining quota
|
||||
```
|
||||
|
||||
**Solution: Wait for quota reset**
|
||||
```bash
|
||||
# Claude API resets every 5 hours
|
||||
# Check exact reset time in Claude CLI: /usage
|
||||
|
||||
# Meanwhile, system escalates to Level 4 (human alert)
|
||||
```
|
||||
|
||||
**Workaround: Increase timeout for next attempt**
|
||||
```bash
|
||||
# Edit .env
|
||||
nano ~/.openclaw/.env
|
||||
|
||||
# Increase timeout:
|
||||
EMERGENCY_RECOVERY_TIMEOUT=3600 # 1 hour instead of 30 min
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
### Issue: Claude recovery times out
|
||||
|
||||
**Symptoms:**
|
||||
- Emergency Recovery runs for 30 minutes
|
||||
- Gateway still unhealthy
|
||||
- No clear failure reason in logs
|
||||
|
||||
**Diagnosis:**
|
||||
```bash
|
||||
# Check Claude session log
|
||||
tail -200 ~/openclaw/memory/claude-session-*.log
|
||||
|
||||
# Look for:
|
||||
# - Errors executing commands
|
||||
# - Stuck waiting for input
|
||||
# - Network issues
|
||||
```
|
||||
|
||||
**Solution 1: Increase timeout**
|
||||
```bash
|
||||
# Edit .env
|
||||
nano ~/.openclaw/.env
|
||||
|
||||
# Increase timeout:
|
||||
EMERGENCY_RECOVERY_TIMEOUT=3600 # 1 hour
|
||||
```
|
||||
|
||||
**Solution 2: Check manual recovery**
|
||||
```bash
|
||||
# What would you do manually?
|
||||
openclaw status
|
||||
tail -100 ~/.openclaw/logs/gateway.log
|
||||
|
||||
# Apply the fix yourself, then analyze why Claude couldn't
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 🚨 Level 4: Discord Notification Issues
|
||||
|
||||
### Issue: No Discord notifications
|
||||
|
||||
**Symptoms:**
|
||||
- Level 4 should trigger but no messages in Discord
|
||||
- Emergency Recovery Monitor cron runs but silent
|
||||
|
||||
**Diagnosis:**
|
||||
```bash
|
||||
# Check webhook URL
|
||||
source ~/.openclaw/.env
|
||||
echo $DISCORD_WEBHOOK_URL
|
||||
|
||||
# Test webhook manually
|
||||
curl -X POST "$DISCORD_WEBHOOK_URL" \
|
||||
-H "Content-Type: application/json" \
|
||||
-d '{"content": "Test notification"}'
|
||||
```
|
||||
|
||||
**Solution 1: Webhook URL not set**
|
||||
```bash
|
||||
# Edit .env
|
||||
nano ~/.openclaw/.env
|
||||
|
||||
# Add:
|
||||
DISCORD_WEBHOOK_URL="https://discord.com/api/webhooks/YOUR_ID/YOUR_TOKEN"
|
||||
```
|
||||
|
||||
**Solution 2: Webhook URL invalid**
|
||||
```bash
|
||||
# Get new webhook from Discord:
|
||||
# Server Settings > Integrations > Webhooks > New Webhook
|
||||
|
||||
# Copy URL and update .env
|
||||
nano ~/.openclaw/.env
|
||||
```
|
||||
|
||||
**Solution 3: Network issues**
|
||||
```bash
|
||||
# Test internet connectivity
|
||||
ping -c 3 discord.com
|
||||
|
||||
# Test DNS resolution
|
||||
nslookup discord.com
|
||||
|
||||
# If behind proxy, check proxy settings
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
### Issue: Duplicate Discord notifications
|
||||
|
||||
**Symptoms:**
|
||||
- Same alert sent multiple times
|
||||
- Alert flood in Discord channel
|
||||
|
||||
**Diagnosis:**
|
||||
```bash
|
||||
# Check alert tracking file
|
||||
cat ~/openclaw/memory/.emergency-alert-sent
|
||||
|
||||
# Check Monitor cron frequency
|
||||
openclaw cron list | grep "Emergency Recovery"
|
||||
```
|
||||
|
||||
**Solution: Alert file corrupted**
|
||||
```bash
|
||||
# Remove alert tracking file
|
||||
rm ~/openclaw/memory/.emergency-alert-sent
|
||||
|
||||
# Next alert will reset tracking
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 🔧 General Issues
|
||||
|
||||
### Issue: Logs filling up disk
|
||||
|
||||
**Symptoms:**
|
||||
- `~/openclaw/memory/` grows to GB
|
||||
- Old logs not deleted
|
||||
|
||||
**Diagnosis:**
|
||||
```bash
|
||||
# Check disk usage
|
||||
du -sh ~/openclaw/memory/
|
||||
|
||||
# Count log files
|
||||
ls ~/openclaw/memory/*.log | wc -l
|
||||
```
|
||||
|
||||
**Solution: Manual cleanup**
|
||||
```bash
|
||||
# Delete logs older than 14 days
|
||||
find ~/openclaw/memory -name "healthcheck-*.log" -mtime +14 -delete
|
||||
find ~/openclaw/memory -name "emergency-recovery-*.log" -mtime +14 -delete
|
||||
find ~/openclaw/memory -name "claude-session-*.log" -mtime +14 -delete
|
||||
```
|
||||
|
||||
**Prevention: Add cleanup cron**
|
||||
```bash
|
||||
openclaw cron add \
|
||||
--name "Log Rotation (Self-Healing)" \
|
||||
--schedule '0 3 * * *' \
|
||||
--command 'find ~/openclaw/memory -name "*healthcheck*.log" -o -name "*emergency-recovery*.log" -o -name "*claude-session*.log" -mtime +14 -delete' \
|
||||
--session isolated
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
### Issue: Scripts fail with "Permission denied"
|
||||
|
||||
**Symptoms:**
|
||||
- LaunchAgent logs show: `Permission denied: gateway-healthcheck.sh`
|
||||
|
||||
**Solution:**
|
||||
```bash
|
||||
chmod +x ~/openclaw/scripts/*.sh
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
### Issue: Environment variables not loading
|
||||
|
||||
**Symptoms:**
|
||||
- Scripts use default values instead of custom `.env` settings
|
||||
|
||||
**Diagnosis:**
|
||||
```bash
|
||||
# Check .env exists
|
||||
ls -lh ~/.openclaw/.env
|
||||
|
||||
# Check .env syntax
|
||||
cat ~/.openclaw/.env | grep -v '^#' | grep '='
|
||||
```
|
||||
|
||||
**Solution: Fix .env syntax**
|
||||
```bash
|
||||
# Edit .env
|
||||
nano ~/.openclaw/.env
|
||||
|
||||
# Correct format:
|
||||
# KEY="value" ✅
|
||||
# KEY='value' ✅
|
||||
# KEY=value ✅
|
||||
#
|
||||
# KEY = "value" ❌ (spaces around =)
|
||||
# KEY="value ❌ (missing closing quote)
|
||||
|
||||
# Reload LaunchAgent after fixing
|
||||
launchctl unload ~/Library/LaunchAgents/com.openclaw.healthcheck.plist
|
||||
launchctl load ~/Library/LaunchAgents/com.openclaw.healthcheck.plist
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 🧪 Testing & Validation
|
||||
|
||||
### Force trigger each level
|
||||
|
||||
#### Test Level 1: Watchdog
|
||||
```bash
|
||||
kill -9 $(pgrep -f openclaw-gateway)
|
||||
sleep 180
|
||||
curl http://localhost:18789/
|
||||
```
|
||||
|
||||
#### Test Level 2: Health Check
|
||||
```bash
|
||||
# Stop Gateway
|
||||
openclaw gateway stop
|
||||
|
||||
# Wait for Health Check (5 min max)
|
||||
tail -f ~/openclaw/memory/healthcheck-$(date +%Y-%m-%d).log
|
||||
|
||||
# Should see restart attempts
|
||||
```
|
||||
|
||||
#### Test Level 3: Claude Recovery
|
||||
```bash
|
||||
# Inject config error (backup first!)
|
||||
cp ~/.openclaw/openclaw.json ~/.openclaw/openclaw.json.bak
|
||||
|
||||
# Break config (e.g., change port to invalid value)
|
||||
# Then restart Gateway and wait ~8 min
|
||||
|
||||
# Watch Level 3 trigger
|
||||
tail -f ~/openclaw/memory/emergency-recovery-*.log
|
||||
```
|
||||
|
||||
#### Test Level 4: Discord Alert
|
||||
```bash
|
||||
# Simulate Level 3 failure
|
||||
cat > ~/openclaw/memory/emergency-recovery-test-$(date +%Y-%m-%d-%H%M).log << 'EOF'
|
||||
[2026-02-06 20:00:00] === Emergency Recovery Started ===
|
||||
[2026-02-06 20:30:00] Gateway still unhealthy (HTTP 500)
|
||||
|
||||
=== MANUAL INTERVENTION REQUIRED ===
|
||||
Level 1 (Watchdog) ❌
|
||||
Level 2 (Health Check) ❌
|
||||
Level 3 (Claude Recovery) ❌
|
||||
EOF
|
||||
|
||||
# Run monitor
|
||||
bash ~/openclaw/scripts/emergency-recovery-monitor.sh
|
||||
|
||||
# Check Discord for alert
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 📚 Advanced Troubleshooting
|
||||
|
||||
### Enable debug logging
|
||||
|
||||
Add to scripts (temporary):
|
||||
|
||||
```bash
|
||||
# In gateway-healthcheck.sh
|
||||
set -x # Enable bash debug mode
|
||||
|
||||
# View verbose output
|
||||
tail -f ~/openclaw/memory/healthcheck-$(date +%Y-%m-%d).log
|
||||
```
|
||||
|
||||
### Check macOS system logs
|
||||
|
||||
```bash
|
||||
# Filter for openclaw-related errors
|
||||
log show --predicate 'process == "launchd" AND eventMessage CONTAINS "openclaw"' --last 1h
|
||||
|
||||
# Check LaunchAgent errors
|
||||
log show --predicate 'subsystem == "com.apple.launchd"' --last 1h
|
||||
```
|
||||
|
||||
### Verify Gateway port
|
||||
|
||||
```bash
|
||||
# Check what's listening on 18789
|
||||
lsof -i :18789
|
||||
|
||||
# Expected: openclaw-gateway process
|
||||
```
|
||||
|
||||
### Check for port conflicts
|
||||
|
||||
```bash
|
||||
# Find processes using common ports
|
||||
lsof -i :18789
|
||||
lsof -i :8080
|
||||
lsof -i :3000
|
||||
|
||||
# If conflict, change Gateway port in ~/.openclaw/openclaw.json
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 🆘 Still Stuck?
|
||||
|
||||
### Get help from the community
|
||||
|
||||
1. **GitHub Issues:** [github.com/ramsbaby/openclaw-self-healing/issues](https://github.com/ramsbaby/openclaw-self-healing/issues)
|
||||
2. **OpenClaw Discord:** [discord.com/invite/clawd](https://discord.com/invite/clawd)
|
||||
3. **Include in your report:**
|
||||
- macOS version: `sw_vers`
|
||||
- OpenClaw version: `openclaw version`
|
||||
- Self-Healing logs: Last 50 lines of `healthcheck-*.log` and `emergency-recovery-*.log`
|
||||
- Script versions: `head -5 ~/openclaw/scripts/*.sh`
|
||||
|
||||
---
|
||||
|
||||
<p align="center">
|
||||
<strong>Most issues are config or permissions related.</strong><br>
|
||||
When in doubt, check <code>.env</code> and re-run <code>chmod +x</code>.
|
||||
</p>
|
||||
414
skills/openclaw-self-healing/references/self-healing-system.md
Normal file
414
skills/openclaw-self-healing/references/self-healing-system.md
Normal file
@@ -0,0 +1,414 @@
|
||||
# OpenClaw Self-Healing System
|
||||
|
||||
> "시스템이 스스로를 치료하지 못하면 외부 의사를 부른다" — 메타 레벨 자가복구
|
||||
|
||||
## 개요
|
||||
|
||||
OpenClaw Gateway는 4단계 자가복구(Self-Healing) 시스템으로 장애 상황에서 자동 복구를 시도합니다.
|
||||
|
||||
**설계 철학:**
|
||||
- Level 1-2: 빠른 자동 복구 (초 단위)
|
||||
- Level 3: 지능형 진단 및 복구 (분 단위)
|
||||
- Level 4: 인간 개입 요청 (알림)
|
||||
|
||||
---
|
||||
|
||||
## 아키텍처
|
||||
|
||||
```
|
||||
┌─────────────────────────────────────────────────────────┐
|
||||
│ Level 1: Watchdog (180초 간격) │
|
||||
│ ├─ LaunchAgent: ai.openclaw.watchdog │
|
||||
│ └─ 프로세스 존재 체크 → 재시작 │
|
||||
└─────────────────────────────────────────────────────────┘
|
||||
↓ (프로세스는 살아있지만 먹통)
|
||||
┌─────────────────────────────────────────────────────────┐
|
||||
│ Level 2: Health Check (300초 간격) │
|
||||
│ ├─ Script: gateway-healthcheck.sh │
|
||||
│ ├─ LaunchAgent: com.openclaw.healthcheck │
|
||||
│ ├─ HTTP 200 응답 검증 │
|
||||
│ ├─ 실패 시 3회 재시도 (30초 간격) │
|
||||
│ └─ 여전히 실패 → Level 3 escalation │
|
||||
└─────────────────────────────────────────────────────────┘
|
||||
↓ (5분간 복구 실패)
|
||||
┌─────────────────────────────────────────────────────────┐
|
||||
│ Level 3: Claude Emergency Recovery (30분 타임아웃) │
|
||||
│ ├─ Script: emergency-recovery.sh │
|
||||
│ ├─ tmux로 Claude Code PTY 세션 시작 │
|
||||
│ ├─ 자동 진단: │
|
||||
│ │ - openclaw status │
|
||||
│ │ - 로그 분석 (~/.openclaw/logs/*.log) │
|
||||
│ │ - 설정 검증 (openclaw.json) │
|
||||
│ │ - 포트 충돌 체크 (lsof -i :18789) │
|
||||
│ │ - 의존성 체크 (npm list, node --version) │
|
||||
│ ├─ 복구 시도 (설정 수정, 프로세스 재시작) │
|
||||
│ ├─ 복구 리포트 생성: │
|
||||
│ │ - memory/emergency-recovery-report-*.md │
|
||||
│ │ - memory/claude-session-*.log │
|
||||
│ └─ 성공/실패 판정 (HTTP 200 체크) │
|
||||
└─────────────────────────────────────────────────────────┘
|
||||
↓ (Claude 복구도 실패)
|
||||
┌─────────────────────────────────────────────────────────┐
|
||||
│ Level 4: Discord Notification (300초 간격 모니터링) │
|
||||
│ ├─ Script: emergency-recovery-monitor.sh │
|
||||
│ ├─ Cron: eddd4e18-b995-4420-8465-7c6927280228 │
|
||||
│ ├─ 최근 30분 emergency-recovery 로그 감시 │
|
||||
│ ├─ "MANUAL INTERVENTION REQUIRED" 패턴 검색 │
|
||||
│ └─ #jarvis-health 채널에 알림 전송 │
|
||||
└─────────────────────────────────────────────────────────┘
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 구성 요소
|
||||
|
||||
### Level 1: Watchdog
|
||||
|
||||
**파일:**
|
||||
- `~/Library/LaunchAgents/ai.openclaw.watchdog.plist`
|
||||
|
||||
**동작:**
|
||||
- 180초마다 OpenClaw 프로세스 존재 확인
|
||||
- 프로세스 없으면 자동 재시작
|
||||
|
||||
**한계:**
|
||||
- 프로세스는 살아있지만 HTTP 응답 못하는 경우 감지 불가
|
||||
|
||||
---
|
||||
|
||||
### Level 2: Health Check
|
||||
|
||||
**파일:**
|
||||
- `~/openclaw/scripts/gateway-healthcheck.sh`
|
||||
- `~/Library/LaunchAgents/com.openclaw.healthcheck.plist`
|
||||
|
||||
**동작:**
|
||||
1. HTTP GET `http://localhost:18789/` → 200 체크
|
||||
2. 실패 시 재시작 (30초 대기)
|
||||
3. 3회 재시도
|
||||
4. 여전히 실패 → 5분 대기
|
||||
5. 5분 후에도 실패 → Level 3 트리거
|
||||
|
||||
**로그:**
|
||||
- `~/openclaw/memory/healthcheck-YYYY-MM-DD.log`
|
||||
|
||||
**설치:**
|
||||
```bash
|
||||
launchctl load ~/Library/LaunchAgents/com.openclaw.healthcheck.plist
|
||||
```
|
||||
|
||||
**제거:**
|
||||
```bash
|
||||
launchctl unload ~/Library/LaunchAgents/com.openclaw.healthcheck.plist
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
### Level 3: Claude Emergency Recovery
|
||||
|
||||
**파일:**
|
||||
- `~/openclaw/scripts/emergency-recovery.sh`
|
||||
|
||||
**동작:**
|
||||
1. tmux 세션 생성: `emergency_recovery_TIMESTAMP`
|
||||
2. Claude Code 실행 (`claude`)
|
||||
3. 워크스페이스 신뢰 (자동 Enter)
|
||||
4. 긴급 복구 명령 전송:
|
||||
```
|
||||
OpenClaw 게이트웨이가 5분간 재시작했으나 복구되지 않았습니다.
|
||||
긴급 진단 및 복구를 시작하세요.
|
||||
|
||||
작업 순서:
|
||||
1. openclaw status 체크
|
||||
2. 로그 분석 (~/.openclaw/logs/*.log)
|
||||
3. 설정 검증 (~/.openclaw/openclaw.json)
|
||||
4. 포트 충돌 체크 (lsof -i :18789)
|
||||
5. 의존성 체크 (npm list, node --version)
|
||||
6. 복구 시도 (설정 수정, 프로세스 재시작)
|
||||
7. 결과를 memory/emergency-recovery-report-*.md 에 기록
|
||||
```
|
||||
5. 30분 대기
|
||||
6. 복구 결과 확인 (HTTP 200 체크)
|
||||
7. tmux 세션 캡처 및 종료
|
||||
|
||||
**출력 파일:**
|
||||
- `~/openclaw/memory/emergency-recovery-TIMESTAMP.log` (실행 로그)
|
||||
- `~/openclaw/memory/claude-session-TIMESTAMP.log` (Claude 세션 캡처)
|
||||
- `~/openclaw/memory/emergency-recovery-report-TIMESTAMP.md` (Claude 생성, 옵션)
|
||||
|
||||
**수동 실행:**
|
||||
```bash
|
||||
~/openclaw/scripts/emergency-recovery.sh
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
### Level 4: Discord Notification
|
||||
|
||||
**파일:**
|
||||
- `~/openclaw/scripts/emergency-recovery-monitor.sh`
|
||||
|
||||
**동작:**
|
||||
1. 최근 30분 내 `emergency-recovery-*.log` 파일 검색
|
||||
2. "MANUAL INTERVENTION REQUIRED" 패턴 검색
|
||||
3. 발견 시 #jarvis-health 채널에 알림
|
||||
4. 중복 알림 방지 (`.emergency-alert-sent` 파일)
|
||||
|
||||
**Cron 설정:**
|
||||
- **ID:** `eddd4e18-b995-4420-8465-7c6927280228`
|
||||
- **주기:** 5분 (`everyMs: 300000`)
|
||||
- **세션:** isolated
|
||||
- **모델:** claude-haiku-4-5
|
||||
- **채널:** Discord #jarvis-health (1468429321738911947)
|
||||
|
||||
**알림 형식:**
|
||||
```
|
||||
🚨 긴급: OpenClaw 자가복구 실패
|
||||
|
||||
시간: YYYY-MM-DD-HHMM
|
||||
상태:
|
||||
- Level 1 (Watchdog) ❌
|
||||
- Level 2 (Health Check) ❌
|
||||
- Level 3 (Claude Recovery) ❌
|
||||
|
||||
수동 개입 필요합니다.
|
||||
|
||||
로그:
|
||||
- ~/openclaw/memory/emergency-recovery-*.log
|
||||
- ~/openclaw/memory/claude-session-*.log
|
||||
- ~/openclaw/memory/emergency-recovery-report-*.md (Claude 생성)
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 테스트 시나리오
|
||||
|
||||
### 1. Level 1 테스트 (Watchdog)
|
||||
|
||||
**시나리오:** 프로세스 강제 종료
|
||||
|
||||
```bash
|
||||
# Gateway PID 확인
|
||||
ps aux | grep openclaw-gateway | grep -v grep
|
||||
|
||||
# 강제 종료
|
||||
kill -9 <PID>
|
||||
|
||||
# 3분 이내 자동 재시작 확인
|
||||
sleep 180
|
||||
curl http://localhost:18789/
|
||||
```
|
||||
|
||||
**예상 결과:**
|
||||
- Watchdog가 180초 이내 프로세스 재시작
|
||||
- HTTP 200 응답 복구
|
||||
|
||||
---
|
||||
|
||||
### 2. Level 2 테스트 (Health Check)
|
||||
|
||||
**시나리오:** HTTP 응답 실패 (포트 블록)
|
||||
|
||||
```bash
|
||||
# 포트 블록 (방화벽 규칙 또는 프록시 설정)
|
||||
# 또는 openclaw.json에서 잘못된 포트 설정
|
||||
|
||||
# Health Check 로그 모니터링
|
||||
tail -f ~/openclaw/memory/healthcheck-$(date +%Y-%m-%d).log
|
||||
```
|
||||
|
||||
**예상 결과:**
|
||||
- Health Check가 HTTP 실패 감지
|
||||
- 3회 재시도 (30초 간격)
|
||||
- 5분 후에도 실패 시 Level 3 트리거
|
||||
|
||||
---
|
||||
|
||||
### 3. Level 3 테스트 (Claude Recovery)
|
||||
|
||||
**시나리오:** 설정 오류 주입
|
||||
|
||||
```bash
|
||||
# openclaw.json 백업
|
||||
cp ~/.openclaw/openclaw.json ~/.openclaw/openclaw.json.bak
|
||||
|
||||
# 의도적 오류 주입 (예: 잘못된 포트)
|
||||
# (수동 편집 필요)
|
||||
|
||||
# Gateway 재시작
|
||||
openclaw gateway restart
|
||||
|
||||
# Emergency Recovery 트리거 대기 (최대 8분)
|
||||
# - Health Check 감지: ~5분
|
||||
# - Level 3 시작: +30분
|
||||
|
||||
# 로그 모니터링
|
||||
tail -f ~/openclaw/memory/emergency-recovery-*.log
|
||||
```
|
||||
|
||||
**예상 결과:**
|
||||
- Claude가 설정 오류 감지
|
||||
- 설정 수정 시도
|
||||
- 복구 리포트 생성
|
||||
- HTTP 200 복구 또는 실패 리포트
|
||||
|
||||
---
|
||||
|
||||
### 4. Level 4 테스트 (Discord Notification)
|
||||
|
||||
**시나리오:** Level 3 실패 시뮬레이션
|
||||
|
||||
```bash
|
||||
# Level 3 실패 로그 수동 생성
|
||||
cat > ~/openclaw/memory/emergency-recovery-test-$(date +%Y-%m-%d-%H%M).log << 'EOF'
|
||||
[2026-02-05 20:00:00] === Emergency Recovery Started ===
|
||||
[2026-02-05 20:30:00] Gateway still unhealthy after Claude recovery (HTTP 500)
|
||||
|
||||
=== MANUAL INTERVENTION REQUIRED ===
|
||||
Level 1 (Watchdog) ❌
|
||||
Level 2 (Health Check) ❌
|
||||
Level 3 (Claude Recovery) ❌
|
||||
EOF
|
||||
|
||||
# Monitor 스크립트 실행 (또는 크론 대기)
|
||||
~/openclaw/scripts/emergency-recovery-monitor.sh
|
||||
```
|
||||
|
||||
**예상 결과:**
|
||||
- Discord #jarvis-health에 알림 전송
|
||||
- 중복 알림 방지 기록 생성
|
||||
|
||||
---
|
||||
|
||||
## 운영 가이드
|
||||
|
||||
### 상태 확인
|
||||
|
||||
```bash
|
||||
# LaunchAgent 상태
|
||||
launchctl list | grep openclaw
|
||||
|
||||
# Health Check 로그
|
||||
tail -f ~/openclaw/memory/healthcheck-$(date +%Y-%m-%d).log
|
||||
|
||||
# Emergency Recovery 로그
|
||||
ls -lt ~/openclaw/memory/emergency-recovery-*.log | head -5
|
||||
|
||||
# Cron 상태
|
||||
openclaw cron list | grep "Emergency Recovery"
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
### 수동 복구
|
||||
|
||||
Level 3 실패 시 수동 복구 절차:
|
||||
|
||||
```bash
|
||||
# 1. 로그 확인
|
||||
tail -100 ~/.openclaw/logs/gateway.log
|
||||
tail -100 ~/.openclaw/logs/gateway.err.log
|
||||
|
||||
# 2. 설정 검증
|
||||
openclaw doctor --non-interactive
|
||||
|
||||
# 3. 포트 충돌 체크
|
||||
lsof -i :18789
|
||||
|
||||
# 4. 의존성 체크
|
||||
node --version
|
||||
npm list -g openclaw
|
||||
|
||||
# 5. Gateway 완전 재시작
|
||||
openclaw gateway stop
|
||||
sleep 5
|
||||
openclaw gateway start
|
||||
|
||||
# 6. 복구 확인
|
||||
curl -i http://localhost:18789/
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
### 비활성화
|
||||
|
||||
시스템 유지보수 또는 디버깅 시:
|
||||
|
||||
```bash
|
||||
# Health Check 비활성화
|
||||
launchctl unload ~/Library/LaunchAgents/com.openclaw.healthcheck.plist
|
||||
|
||||
# Emergency Recovery Monitor 크론 비활성화
|
||||
openclaw cron disable eddd4e18-b995-4420-8465-7c6927280228
|
||||
|
||||
# 재활성화
|
||||
launchctl load ~/Library/LaunchAgents/com.openclaw.healthcheck.plist
|
||||
openclaw cron enable eddd4e18-b995-4420-8465-7c6927280228
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 모니터링 메트릭
|
||||
|
||||
**추적 지표:**
|
||||
|
||||
| 지표 | 수집 위치 | 목표 |
|
||||
|------|----------|------|
|
||||
| Health Check 성공률 | healthcheck-*.log | > 99% |
|
||||
| Level 1 복구 횟수 | watchdog.log | < 1/day |
|
||||
| Level 2 복구 횟수 | healthcheck-*.log | < 1/week |
|
||||
| Level 3 트리거 횟수 | emergency-recovery-*.log | 0/month |
|
||||
| Level 4 알림 횟수 | Discord #jarvis-health | 0/month |
|
||||
| 평균 복구 시간 | healthcheck-*.log | < 5분 (Level 1-2) |
|
||||
|
||||
**주간 리뷰 (일요일 23:30 감사 크론):**
|
||||
- Health Check 로그 분석
|
||||
- Level 3 트리거 이력 확인
|
||||
- 반복 패턴 식별
|
||||
- 시스템 개선 제안
|
||||
|
||||
---
|
||||
|
||||
## 제한사항
|
||||
|
||||
1. **Claude Code 의존성**
|
||||
- Level 3는 Claude CLI 설치 필요
|
||||
- Claude API 할당량 소진 시 Level 3 실패 가능
|
||||
|
||||
2. **tmux 의존성**
|
||||
- PTY 세션에 tmux 필요
|
||||
- tmux 설치 안 되어 있으면 Level 3 불가
|
||||
|
||||
3. **네트워크 장애**
|
||||
- Claude API 접근 불가 시 Level 3 실패
|
||||
- Discord API 접근 불가 시 Level 4 알림 실패
|
||||
|
||||
4. **macOS 전용**
|
||||
- LaunchAgent는 macOS 전용
|
||||
- Linux는 systemd 변환 필요
|
||||
|
||||
---
|
||||
|
||||
## 확장 계획
|
||||
|
||||
**Phase 2 (미래):**
|
||||
- [ ] GitHub Issues 자동 생성 (Level 4 실패 시)
|
||||
- [ ] Telegram 알림 추가 (이중화)
|
||||
- [ ] Prometheus 메트릭 수집
|
||||
- [ ] Grafana 대시보드 구축
|
||||
- [ ] Multi-node 지원 (클러스터 환경)
|
||||
|
||||
---
|
||||
|
||||
## 참고 자료
|
||||
|
||||
- [OpenClaw Docs](https://docs.openclaw.ai)
|
||||
- [Moltbook: Nightly Build Pattern](https://moltbook.com) (Level 3 영감)
|
||||
- [Moltbook: Reliability Check](https://moltbook.com) (Health Check 영감)
|
||||
- [Claude Code CLI](https://docs.anthropic.com/en/docs/claude-code)
|
||||
|
||||
---
|
||||
|
||||
**작성일:** 2026-02-05
|
||||
**최종 업데이트:** 2026-02-05
|
||||
**작성자:** Jarvis (Self-Healing System Implementation)
|
||||
Reference in New Issue
Block a user