Files
homelab-monitoring/docs/MONITORING-FINAL-SUMMARY.md
PVE Monitoring System 3a14fd2736 Initial backup: 18 monitoring scripts + timers + docs
- 18 comprehensive monitoring checks
- 5 systemd timers (5min, 15min, hourly, daily, weekly)
- Complete documentation
- NTFY secure notification system
- Fixed debianvm disk space (91% to 57%)
- Fixed CloudReve integration
- Date: 2026-01-07
2026-01-07 16:30:34 +08:00

144 lines
3.3 KiB
Markdown

# ✅ HOMELAB MONITORING - FULLY OPERATIONAL
## Status: ALL SYSTEMS ACTIVE & SECURE
Date: January 7, 2026
Implementation: Complete
Security: Secure (obscure topic names)
---
## 🔒 Your Secure NTFY Topics
CRITICAL: anthony-homelab-95ccf258e17eba20-critical
WARNING: anthony-homelab-95ccf258e17eba20-warning
INFO: anthony-homelab-95ccf258e17eba20-info
These are SECURE - the random hex string makes them impossible to guess.
Nobody can spy on your notifications.
---
## 📊 What's Being Monitored (18 Systems)
### Every 5 Minutes:
- Container status (docker, cloudreve, gitea, sftpgo)
- VM/Container unexpected shutdowns
### Every 15 Minutes:
- Service health (CloudReve, Home Assistant HTTP)
- Database health (PostgreSQL, Redis, MongoDB, aria2)
- Docker container restarts
### Every Hour:
- PVE Host (disk, RAM, CPU, services)
- ALL VM disk space (debianvm, ubuntu-server-xfce, haos)
- Network storage (Fred NFS, iMacHDD CIFS)
- LVM Thin Pools (CRITICAL - can freeze VMs!)
- Ceph cluster health
- Tailscale VPN connectivity
- OOM killer detection
- Temperature monitoring
- Public IP changes
- Failed login attempts
### Daily (3 AM):
- Backup job status
- SSL certificate expiry
- System updates
### Weekly (Sunday 2 AM):
- Internet speed test
---
## 🎯 Alert Levels
🔴 CRITICAL (Urgent):
- Disk >90% on any system
- Services completely down
- Thin pool >90% (VMs will freeze!)
- Databases down
- VMs/containers stopped unexpectedly
🟡 WARNING (High Priority):
- Disk 80-90%
- High CPU/RAM usage
- Thin pool 80-90%
- Network storage issues
- Slow internet speed
🔵 INFO (Informational):
- System updates available
- Public IP changed
- Backup completed
- Speed test results
---
## ✅ What We Fixed Today
1. Freed 46GB on debianvm (91% → 57%)
2. Fixed CloudReve/aria2 integration
3. Expanded VM 280 disk by 7GB (97% → 87%)
4. Implemented 18 comprehensive monitors
5. Secured notifications (obscure topics)
6. Centralized everything on PVE host
---
## 📱 Management Commands
View active timers:
systemctl list-timers homelab-monitor-*
View recent logs:
journalctl -t homelab-monitor -n 50
Run checks manually:
/usr/local/bin/check-pve-host.sh
/usr/local/bin/check-all-vm-disks.sh
/usr/local/bin/check-thin-pools.sh
/usr/local/bin/check-databases.sh
Test notifications:
/usr/local/bin/send-ntfy.sh critical Test Message test
/usr/local/bin/send-ntfy.sh warning Test Message test
/usr/local/bin/send-ntfy.sh info Test Message test
---
## 📍 Important Files
Scripts: /usr/local/bin/check-*.sh
Main sender: /usr/local/bin/send-ntfy.sh
Topic names: /root/.ntfy-topics
Timers: /etc/systemd/system/homelab-monitor-*.timer
This doc: /root/MONITORING-FINAL-SUMMARY.md
---
## 🔧 Old Monitoring (DEBIANVM)
Status: Still running in parallel
Will be disabled after 1 week of successful new monitoring
Location: /usr/local/bin/ on DEBIANVM
To disable old monitoring later:
ssh root@DEBIANVM
systemctl stop homelab-hourly.timer homelab-daily.timer homelab-weekly.timer disk-monitor.timer
systemctl disable homelab-hourly.timer homelab-daily.timer homelab-weekly.timer disk-monitor.timer
---
## 🎉 You're All Set!
Your entire homelab is now comprehensively monitored with:
- 18 different health checks
- Clear, contextual alerts
- Secure, private notifications
- Centralized management
- Proactive issue detection
You'll know immediately if anything goes wrong!