- 18 comprehensive monitoring checks - 5 systemd timers (5min, 15min, hourly, daily, weekly) - Complete documentation - NTFY secure notification system - Fixed debianvm disk space (91% to 57%) - Fixed CloudReve integration - Date: 2026-01-07
144 lines
3.3 KiB
Markdown
144 lines
3.3 KiB
Markdown
# ✅ HOMELAB MONITORING - FULLY OPERATIONAL
|
|
|
|
## Status: ALL SYSTEMS ACTIVE & SECURE
|
|
|
|
Date: January 7, 2026
|
|
Implementation: Complete
|
|
Security: Secure (obscure topic names)
|
|
|
|
---
|
|
|
|
## 🔒 Your Secure NTFY Topics
|
|
|
|
CRITICAL: anthony-homelab-95ccf258e17eba20-critical
|
|
WARNING: anthony-homelab-95ccf258e17eba20-warning
|
|
INFO: anthony-homelab-95ccf258e17eba20-info
|
|
|
|
These are SECURE - the random hex string makes them impossible to guess.
|
|
Nobody can spy on your notifications.
|
|
|
|
---
|
|
|
|
## 📊 What's Being Monitored (18 Systems)
|
|
|
|
### Every 5 Minutes:
|
|
- Container status (docker, cloudreve, gitea, sftpgo)
|
|
- VM/Container unexpected shutdowns
|
|
|
|
### Every 15 Minutes:
|
|
- Service health (CloudReve, Home Assistant HTTP)
|
|
- Database health (PostgreSQL, Redis, MongoDB, aria2)
|
|
- Docker container restarts
|
|
|
|
### Every Hour:
|
|
- PVE Host (disk, RAM, CPU, services)
|
|
- ALL VM disk space (debianvm, ubuntu-server-xfce, haos)
|
|
- Network storage (Fred NFS, iMacHDD CIFS)
|
|
- LVM Thin Pools (CRITICAL - can freeze VMs!)
|
|
- Ceph cluster health
|
|
- Tailscale VPN connectivity
|
|
- OOM killer detection
|
|
- Temperature monitoring
|
|
- Public IP changes
|
|
- Failed login attempts
|
|
|
|
### Daily (3 AM):
|
|
- Backup job status
|
|
- SSL certificate expiry
|
|
- System updates
|
|
|
|
### Weekly (Sunday 2 AM):
|
|
- Internet speed test
|
|
|
|
---
|
|
|
|
## 🎯 Alert Levels
|
|
|
|
🔴 CRITICAL (Urgent):
|
|
- Disk >90% on any system
|
|
- Services completely down
|
|
- Thin pool >90% (VMs will freeze!)
|
|
- Databases down
|
|
- VMs/containers stopped unexpectedly
|
|
|
|
🟡 WARNING (High Priority):
|
|
- Disk 80-90%
|
|
- High CPU/RAM usage
|
|
- Thin pool 80-90%
|
|
- Network storage issues
|
|
- Slow internet speed
|
|
|
|
🔵 INFO (Informational):
|
|
- System updates available
|
|
- Public IP changed
|
|
- Backup completed
|
|
- Speed test results
|
|
|
|
---
|
|
|
|
## ✅ What We Fixed Today
|
|
|
|
1. Freed 46GB on debianvm (91% → 57%)
|
|
2. Fixed CloudReve/aria2 integration
|
|
3. Expanded VM 280 disk by 7GB (97% → 87%)
|
|
4. Implemented 18 comprehensive monitors
|
|
5. Secured notifications (obscure topics)
|
|
6. Centralized everything on PVE host
|
|
|
|
---
|
|
|
|
## 📱 Management Commands
|
|
|
|
View active timers:
|
|
systemctl list-timers homelab-monitor-*
|
|
|
|
View recent logs:
|
|
journalctl -t homelab-monitor -n 50
|
|
|
|
Run checks manually:
|
|
/usr/local/bin/check-pve-host.sh
|
|
/usr/local/bin/check-all-vm-disks.sh
|
|
/usr/local/bin/check-thin-pools.sh
|
|
/usr/local/bin/check-databases.sh
|
|
|
|
Test notifications:
|
|
/usr/local/bin/send-ntfy.sh critical Test Message test
|
|
/usr/local/bin/send-ntfy.sh warning Test Message test
|
|
/usr/local/bin/send-ntfy.sh info Test Message test
|
|
|
|
---
|
|
|
|
## 📍 Important Files
|
|
|
|
Scripts: /usr/local/bin/check-*.sh
|
|
Main sender: /usr/local/bin/send-ntfy.sh
|
|
Topic names: /root/.ntfy-topics
|
|
Timers: /etc/systemd/system/homelab-monitor-*.timer
|
|
This doc: /root/MONITORING-FINAL-SUMMARY.md
|
|
|
|
---
|
|
|
|
## 🔧 Old Monitoring (DEBIANVM)
|
|
|
|
Status: Still running in parallel
|
|
Will be disabled after 1 week of successful new monitoring
|
|
Location: /usr/local/bin/ on DEBIANVM
|
|
|
|
To disable old monitoring later:
|
|
ssh root@DEBIANVM
|
|
systemctl stop homelab-hourly.timer homelab-daily.timer homelab-weekly.timer disk-monitor.timer
|
|
systemctl disable homelab-hourly.timer homelab-daily.timer homelab-weekly.timer disk-monitor.timer
|
|
|
|
---
|
|
|
|
## 🎉 You're All Set!
|
|
|
|
Your entire homelab is now comprehensively monitored with:
|
|
- 18 different health checks
|
|
- Clear, contextual alerts
|
|
- Secure, private notifications
|
|
- Centralized management
|
|
- Proactive issue detection
|
|
|
|
You'll know immediately if anything goes wrong!
|