Files
openclaw-backups/skills/openclaw-self-healing/docs/TROUBLESHOOTING.md

12 KiB

Troubleshooting Guide

Common issues and solutions for OpenClaw Self-Healing System


🔍 Diagnostic Commands

Before diving into specific issues, run these diagnostic commands:

# 1. Check LaunchAgent status
launchctl list | grep openclaw

# 2. Check Health Check logs
tail -50 ~/openclaw/memory/healthcheck-$(date +%Y-%m-%d).log

# 3. Check Emergency Recovery logs
ls -lt ~/openclaw/memory/emergency-recovery-*.log | head -5

# 4. Check Gateway status
openclaw status

# 5. Check cron jobs
openclaw cron list | grep -i "emergency\|health"

# 6. Check script permissions
ls -lh ~/openclaw/scripts/*.sh

🚨 Level 1: Watchdog Issues

Issue: Watchdog not restarting Gateway

Symptoms:

  • Gateway crashes but doesn't restart
  • No automatic recovery after 3 minutes

Diagnosis:

# Check if Watchdog LaunchAgent is loaded
launchctl list | grep openclaw.watchdog

# Expected output:
# -    0    ai.openclaw.watchdog

Solution 1: Watchdog not loaded

# Check if plist exists
ls ~/Library/LaunchAgents/ai.openclaw.watchdog.plist

# If missing, reinstall OpenClaw:
npm install -g openclaw
openclaw onboard --install-daemon

Solution 2: Watchdog disabled

# Reload Watchdog
launchctl unload ~/Library/LaunchAgents/ai.openclaw.watchdog.plist
launchctl load ~/Library/LaunchAgents/ai.openclaw.watchdog.plist

🏥 Level 2: Health Check Issues

Issue: Health Check not running

Symptoms:

  • No healthcheck-*.log files in ~/openclaw/memory/
  • LaunchAgent listed but no activity

Diagnosis:

# Check LaunchAgent status
launchctl list | grep openclaw.healthcheck

# Check LaunchAgent logs
tail -f ~/Library/Logs/com.openclaw.healthcheck.log

Solution 1: LaunchAgent not loaded

launchctl load ~/Library/LaunchAgents/com.openclaw.healthcheck.plist

Solution 2: Script path wrong

# Check plist file
cat ~/Library/LaunchAgents/com.openclaw.healthcheck.plist | grep ProgramArguments -A 2

# Should point to: ~/openclaw/scripts/gateway-healthcheck.sh
# If wrong, edit plist:
nano ~/Library/LaunchAgents/com.openclaw.healthcheck.plist

# Reload after edit:
launchctl unload ~/Library/LaunchAgents/com.openclaw.healthcheck.plist
launchctl load ~/Library/LaunchAgents/com.openclaw.healthcheck.plist

Solution 3: Script not executable

chmod +x ~/openclaw/scripts/gateway-healthcheck.sh

Solution 4: Run manually to test

bash ~/openclaw/scripts/gateway-healthcheck.sh

# Check for errors in output

Issue: Health Check false positives

Symptoms:

  • Health Check reports failure but Gateway is running fine
  • Unnecessary restarts

Diagnosis:

# Check Gateway URL
curl -I http://localhost:18789/

# Check environment variable
source ~/.openclaw/.env
echo $OPENCLAW_GATEWAY_URL

Solution: Wrong Gateway URL

# Edit .env
nano ~/.openclaw/.env

# Set correct URL:
OPENCLAW_GATEWAY_URL="http://localhost:18789/"

# Reload LaunchAgent
launchctl unload ~/Library/LaunchAgents/com.openclaw.healthcheck.plist
launchctl load ~/Library/LaunchAgents/com.openclaw.healthcheck.plist

Issue: Health Check restarts too aggressively

Symptoms:

  • Gateway restarts multiple times per hour
  • Unstable system

Diagnosis:

# Check retry settings
source ~/.openclaw/.env
echo "Max retries: ${HEALTH_CHECK_MAX_RETRIES:-3}"
echo "Retry delay: ${HEALTH_CHECK_RETRY_DELAY:-30}s"

Solution: Increase thresholds

# Edit .env
nano ~/.openclaw/.env

# Add/modify:
HEALTH_CHECK_MAX_RETRIES=5
HEALTH_CHECK_RETRY_DELAY=60
HEALTH_CHECK_ESCALATION_WAIT=600

# Reload LaunchAgent
launchctl unload ~/Library/LaunchAgents/com.openclaw.healthcheck.plist
launchctl load ~/Library/LaunchAgents/com.openclaw.healthcheck.plist

🧠 Level 3: Claude Recovery Issues

Issue: Claude CLI not found

Symptoms:

  • Emergency Recovery logs show: ❌ Missing dependencies: claude
  • Level 3 skips to Level 4

Diagnosis:

# Check Claude installation
which claude
claude --version

Solution: Install Claude CLI

npm install -g @anthropic-ai/claude-code

# Verify
claude --version

Issue: Claude session fails to start

Symptoms:

  • Emergency Recovery logs show: Starting Claude Code session...
  • Then: ⚠️ Claude workspace trust prompt not detected

Diagnosis:

# Check tmux
which tmux
tmux -V

# Test Claude manually
claude
# Does it prompt "trust this workspace"?

Solution 1: tmux not installed

brew install tmux

Solution 2: Claude workspace already trusted

# This is actually OK — script proceeds anyway
# Check recovery logs for actual failure reason
tail -50 ~/openclaw/memory/emergency-recovery-*.log

Issue: Claude API quota exceeded

Symptoms:

  • Emergency Recovery logs show: ⚠️ Claude API rate limited or quota exceeded
  • Level 3 fails immediately

Diagnosis:

# Check Claude usage
claude
# Type: /usage
# Check remaining quota

Solution: Wait for quota reset

# Claude API resets every 5 hours
# Check exact reset time in Claude CLI: /usage

# Meanwhile, system escalates to Level 4 (human alert)

Workaround: Increase timeout for next attempt

# Edit .env
nano ~/.openclaw/.env

# Increase timeout:
EMERGENCY_RECOVERY_TIMEOUT=3600  # 1 hour instead of 30 min

Issue: Claude recovery times out

Symptoms:

  • Emergency Recovery runs for 30 minutes
  • Gateway still unhealthy
  • No clear failure reason in logs

Diagnosis:

# Check Claude session log
tail -200 ~/openclaw/memory/claude-session-*.log

# Look for:
# - Errors executing commands
# - Stuck waiting for input
# - Network issues

Solution 1: Increase timeout

# Edit .env
nano ~/.openclaw/.env

# Increase timeout:
EMERGENCY_RECOVERY_TIMEOUT=3600  # 1 hour

Solution 2: Check manual recovery

# What would you do manually?
openclaw status
tail -100 ~/.openclaw/logs/gateway.log

# Apply the fix yourself, then analyze why Claude couldn't

🚨 Level 4: Discord Notification Issues

Issue: No Discord notifications

Symptoms:

  • Level 4 should trigger but no messages in Discord
  • Emergency Recovery Monitor cron runs but silent

Diagnosis:

# Check webhook URL
source ~/.openclaw/.env
echo $DISCORD_WEBHOOK_URL

# Test webhook manually
curl -X POST "$DISCORD_WEBHOOK_URL" \
  -H "Content-Type: application/json" \
  -d '{"content": "Test notification"}'

Solution 1: Webhook URL not set

# Edit .env
nano ~/.openclaw/.env

# Add:
DISCORD_WEBHOOK_URL="https://discord.com/api/webhooks/YOUR_ID/YOUR_TOKEN"

Solution 2: Webhook URL invalid

# Get new webhook from Discord:
# Server Settings > Integrations > Webhooks > New Webhook

# Copy URL and update .env
nano ~/.openclaw/.env

Solution 3: Network issues

# Test internet connectivity
ping -c 3 discord.com

# Test DNS resolution
nslookup discord.com

# If behind proxy, check proxy settings

Issue: Duplicate Discord notifications

Symptoms:

  • Same alert sent multiple times
  • Alert flood in Discord channel

Diagnosis:

# Check alert tracking file
cat ~/openclaw/memory/.emergency-alert-sent

# Check Monitor cron frequency
openclaw cron list | grep "Emergency Recovery"

Solution: Alert file corrupted

# Remove alert tracking file
rm ~/openclaw/memory/.emergency-alert-sent

# Next alert will reset tracking

🔧 General Issues

Issue: Logs filling up disk

Symptoms:

  • ~/openclaw/memory/ grows to GB
  • Old logs not deleted

Diagnosis:

# Check disk usage
du -sh ~/openclaw/memory/

# Count log files
ls ~/openclaw/memory/*.log | wc -l

Solution: Manual cleanup

# Delete logs older than 14 days
find ~/openclaw/memory -name "healthcheck-*.log" -mtime +14 -delete
find ~/openclaw/memory -name "emergency-recovery-*.log" -mtime +14 -delete
find ~/openclaw/memory -name "claude-session-*.log" -mtime +14 -delete

Prevention: Add cleanup cron

openclaw cron add \
  --name "Log Rotation (Self-Healing)" \
  --schedule '0 3 * * *' \
  --command 'find ~/openclaw/memory -name "*healthcheck*.log" -o -name "*emergency-recovery*.log" -o -name "*claude-session*.log" -mtime +14 -delete' \
  --session isolated

Issue: Scripts fail with "Permission denied"

Symptoms:

  • LaunchAgent logs show: Permission denied: gateway-healthcheck.sh

Solution:

chmod +x ~/openclaw/scripts/*.sh

Issue: Environment variables not loading

Symptoms:

  • Scripts use default values instead of custom .env settings

Diagnosis:

# Check .env exists
ls -lh ~/.openclaw/.env

# Check .env syntax
cat ~/.openclaw/.env | grep -v '^#' | grep '='

Solution: Fix .env syntax

# Edit .env
nano ~/.openclaw/.env

# Correct format:
# KEY="value"  ✅
# KEY='value'  ✅
# KEY=value    ✅
#
# KEY = "value"  ❌ (spaces around =)
# KEY="value   ❌ (missing closing quote)

# Reload LaunchAgent after fixing
launchctl unload ~/Library/LaunchAgents/com.openclaw.healthcheck.plist
launchctl load ~/Library/LaunchAgents/com.openclaw.healthcheck.plist

🧪 Testing & Validation

Force trigger each level

Test Level 1: Watchdog

kill -9 $(pgrep -f openclaw-gateway)
sleep 180
curl http://localhost:18789/

Test Level 2: Health Check

# Stop Gateway
openclaw gateway stop

# Wait for Health Check (5 min max)
tail -f ~/openclaw/memory/healthcheck-$(date +%Y-%m-%d).log

# Should see restart attempts

Test Level 3: Claude Recovery

# Inject config error (backup first!)
cp ~/.openclaw/openclaw.json ~/.openclaw/openclaw.json.bak

# Break config (e.g., change port to invalid value)
# Then restart Gateway and wait ~8 min

# Watch Level 3 trigger
tail -f ~/openclaw/memory/emergency-recovery-*.log

Test Level 4: Discord Alert

# Simulate Level 3 failure
cat > ~/openclaw/memory/emergency-recovery-test-$(date +%Y-%m-%d-%H%M).log << 'EOF'
[2026-02-06 20:00:00] === Emergency Recovery Started ===
[2026-02-06 20:30:00] Gateway still unhealthy (HTTP 500)

=== MANUAL INTERVENTION REQUIRED ===
Level 1 (Watchdog) ❌
Level 2 (Health Check) ❌
Level 3 (Claude Recovery) ❌
EOF

# Run monitor
bash ~/openclaw/scripts/emergency-recovery-monitor.sh

# Check Discord for alert

📚 Advanced Troubleshooting

Enable debug logging

Add to scripts (temporary):

# In gateway-healthcheck.sh
set -x  # Enable bash debug mode

# View verbose output
tail -f ~/openclaw/memory/healthcheck-$(date +%Y-%m-%d).log

Check macOS system logs

# Filter for openclaw-related errors
log show --predicate 'process == "launchd" AND eventMessage CONTAINS "openclaw"' --last 1h

# Check LaunchAgent errors
log show --predicate 'subsystem == "com.apple.launchd"' --last 1h

Verify Gateway port

# Check what's listening on 18789
lsof -i :18789

# Expected: openclaw-gateway process

Check for port conflicts

# Find processes using common ports
lsof -i :18789
lsof -i :8080
lsof -i :3000

# If conflict, change Gateway port in ~/.openclaw/openclaw.json

🆘 Still Stuck?

Get help from the community

  1. GitHub Issues: github.com/ramsbaby/openclaw-self-healing/issues
  2. OpenClaw Discord: discord.com/invite/clawd
  3. Include in your report:
    • macOS version: sw_vers
    • OpenClaw version: openclaw version
    • Self-Healing logs: Last 50 lines of healthcheck-*.log and emergency-recovery-*.log
    • Script versions: head -5 ~/openclaw/scripts/*.sh

Most issues are config or permissions related.
When in doubt, check .env and re-run chmod +x.