Initial backup: 18 monitoring scripts + timers + docs
- 18 comprehensive monitoring checks - 5 systemd timers (5min, 15min, hourly, daily, weekly) - Complete documentation - NTFY secure notification system - Fixed debianvm disk space (91% to 57%) - Fixed CloudReve integration - Date: 2026-01-07
This commit is contained in:
143
docs/MONITORING-FINAL-SUMMARY.md
Normal file
143
docs/MONITORING-FINAL-SUMMARY.md
Normal file
@@ -0,0 +1,143 @@
|
||||
# ✅ HOMELAB MONITORING - FULLY OPERATIONAL
|
||||
|
||||
## Status: ALL SYSTEMS ACTIVE & SECURE
|
||||
|
||||
Date: January 7, 2026
|
||||
Implementation: Complete
|
||||
Security: Secure (obscure topic names)
|
||||
|
||||
---
|
||||
|
||||
## 🔒 Your Secure NTFY Topics
|
||||
|
||||
CRITICAL: anthony-homelab-95ccf258e17eba20-critical
|
||||
WARNING: anthony-homelab-95ccf258e17eba20-warning
|
||||
INFO: anthony-homelab-95ccf258e17eba20-info
|
||||
|
||||
These are SECURE - the random hex string makes them impossible to guess.
|
||||
Nobody can spy on your notifications.
|
||||
|
||||
---
|
||||
|
||||
## 📊 What's Being Monitored (18 Systems)
|
||||
|
||||
### Every 5 Minutes:
|
||||
- Container status (docker, cloudreve, gitea, sftpgo)
|
||||
- VM/Container unexpected shutdowns
|
||||
|
||||
### Every 15 Minutes:
|
||||
- Service health (CloudReve, Home Assistant HTTP)
|
||||
- Database health (PostgreSQL, Redis, MongoDB, aria2)
|
||||
- Docker container restarts
|
||||
|
||||
### Every Hour:
|
||||
- PVE Host (disk, RAM, CPU, services)
|
||||
- ALL VM disk space (debianvm, ubuntu-server-xfce, haos)
|
||||
- Network storage (Fred NFS, iMacHDD CIFS)
|
||||
- LVM Thin Pools (CRITICAL - can freeze VMs!)
|
||||
- Ceph cluster health
|
||||
- Tailscale VPN connectivity
|
||||
- OOM killer detection
|
||||
- Temperature monitoring
|
||||
- Public IP changes
|
||||
- Failed login attempts
|
||||
|
||||
### Daily (3 AM):
|
||||
- Backup job status
|
||||
- SSL certificate expiry
|
||||
- System updates
|
||||
|
||||
### Weekly (Sunday 2 AM):
|
||||
- Internet speed test
|
||||
|
||||
---
|
||||
|
||||
## 🎯 Alert Levels
|
||||
|
||||
🔴 CRITICAL (Urgent):
|
||||
- Disk >90% on any system
|
||||
- Services completely down
|
||||
- Thin pool >90% (VMs will freeze!)
|
||||
- Databases down
|
||||
- VMs/containers stopped unexpectedly
|
||||
|
||||
🟡 WARNING (High Priority):
|
||||
- Disk 80-90%
|
||||
- High CPU/RAM usage
|
||||
- Thin pool 80-90%
|
||||
- Network storage issues
|
||||
- Slow internet speed
|
||||
|
||||
🔵 INFO (Informational):
|
||||
- System updates available
|
||||
- Public IP changed
|
||||
- Backup completed
|
||||
- Speed test results
|
||||
|
||||
---
|
||||
|
||||
## ✅ What We Fixed Today
|
||||
|
||||
1. Freed 46GB on debianvm (91% → 57%)
|
||||
2. Fixed CloudReve/aria2 integration
|
||||
3. Expanded VM 280 disk by 7GB (97% → 87%)
|
||||
4. Implemented 18 comprehensive monitors
|
||||
5. Secured notifications (obscure topics)
|
||||
6. Centralized everything on PVE host
|
||||
|
||||
---
|
||||
|
||||
## 📱 Management Commands
|
||||
|
||||
View active timers:
|
||||
systemctl list-timers homelab-monitor-*
|
||||
|
||||
View recent logs:
|
||||
journalctl -t homelab-monitor -n 50
|
||||
|
||||
Run checks manually:
|
||||
/usr/local/bin/check-pve-host.sh
|
||||
/usr/local/bin/check-all-vm-disks.sh
|
||||
/usr/local/bin/check-thin-pools.sh
|
||||
/usr/local/bin/check-databases.sh
|
||||
|
||||
Test notifications:
|
||||
/usr/local/bin/send-ntfy.sh critical Test Message test
|
||||
/usr/local/bin/send-ntfy.sh warning Test Message test
|
||||
/usr/local/bin/send-ntfy.sh info Test Message test
|
||||
|
||||
---
|
||||
|
||||
## 📍 Important Files
|
||||
|
||||
Scripts: /usr/local/bin/check-*.sh
|
||||
Main sender: /usr/local/bin/send-ntfy.sh
|
||||
Topic names: /root/.ntfy-topics
|
||||
Timers: /etc/systemd/system/homelab-monitor-*.timer
|
||||
This doc: /root/MONITORING-FINAL-SUMMARY.md
|
||||
|
||||
---
|
||||
|
||||
## 🔧 Old Monitoring (DEBIANVM)
|
||||
|
||||
Status: Still running in parallel
|
||||
Will be disabled after 1 week of successful new monitoring
|
||||
Location: /usr/local/bin/ on DEBIANVM
|
||||
|
||||
To disable old monitoring later:
|
||||
ssh root@DEBIANVM
|
||||
systemctl stop homelab-hourly.timer homelab-daily.timer homelab-weekly.timer disk-monitor.timer
|
||||
systemctl disable homelab-hourly.timer homelab-daily.timer homelab-weekly.timer disk-monitor.timer
|
||||
|
||||
---
|
||||
|
||||
## 🎉 You're All Set!
|
||||
|
||||
Your entire homelab is now comprehensively monitored with:
|
||||
- 18 different health checks
|
||||
- Clear, contextual alerts
|
||||
- Secure, private notifications
|
||||
- Centralized management
|
||||
- Proactive issue detection
|
||||
|
||||
You'll know immediately if anything goes wrong!
|
||||
44
docs/QUICK-REFERENCE.txt
Normal file
44
docs/QUICK-REFERENCE.txt
Normal file
@@ -0,0 +1,44 @@
|
||||
═══════════════════════════════════════════════════════════
|
||||
HOMELAB MONITORING - QUICK REFERENCE
|
||||
═══════════════════════════════════════════════════════════
|
||||
|
||||
📱 YOUR NTFY TOPICS (subscribed on phone):
|
||||
anthony-homelab-95ccf258e17eba20-critical
|
||||
anthony-homelab-95ccf258e17eba20-warning
|
||||
anthony-homelab-95ccf258e17eba20-info
|
||||
|
||||
🔒 SECURITY: Topics are secure (impossible to guess)
|
||||
|
||||
📊 MONITORING SCHEDULE:
|
||||
Every 5 min → Containers, VM shutdowns
|
||||
Every 15 min → Services, databases
|
||||
Every hour → Disk space, health checks
|
||||
Daily 3 AM → Backups, SSL, updates
|
||||
Weekly → Speed tests
|
||||
|
||||
⚙️ USEFUL COMMANDS:
|
||||
|
||||
Check timer status:
|
||||
systemctl list-timers homelab-monitor-*
|
||||
|
||||
View recent alerts:
|
||||
journalctl -t homelab-monitor -n 50
|
||||
|
||||
Test notification:
|
||||
/usr/local/bin/send-ntfy.sh info "Test" "Message" "test"
|
||||
|
||||
Run checks manually:
|
||||
/usr/local/bin/check-pve-host.sh
|
||||
/usr/local/bin/check-all-vm-disks.sh
|
||||
|
||||
📁 IMPORTANT FILES:
|
||||
/root/MONITORING-FINAL-SUMMARY.md (full docs)
|
||||
/root/.ntfy-topics (topic names)
|
||||
/usr/local/bin/check-*.sh (18 monitoring scripts)
|
||||
|
||||
🎯 WHAT GETS ALERTED:
|
||||
🔴 CRITICAL: Disk >90%, services down, thin pool full
|
||||
🟡 WARNING: Disk 80-90%, high CPU/RAM, network issues
|
||||
🔵 INFO: Updates, IP changes, backup completion
|
||||
|
||||
═══════════════════════════════════════════════════════════
|
||||
127
docs/VERIFICATION-REPORT.txt
Normal file
127
docs/VERIFICATION-REPORT.txt
Normal file
@@ -0,0 +1,127 @@
|
||||
═══════════════════════════════════════════════════════════
|
||||
HOMELAB MONITORING - VERIFICATION REPORT
|
||||
═══════════════════════════════════════════════════════════
|
||||
|
||||
Date: January 7, 2026
|
||||
Status: ✅ ALL SYSTEMS OPERATIONAL
|
||||
|
||||
═══════════════════════════════════════════════════════════
|
||||
VERIFICATION CHECKLIST
|
||||
═══════════════════════════════════════════════════════════
|
||||
|
||||
✅ 18 Monitoring Scripts Created
|
||||
✅ All Scripts Executable and Tested
|
||||
✅ NTFY Sender Script Configured
|
||||
✅ 3 Secure Topics Created
|
||||
✅ 5 Systemd Timers Active
|
||||
✅ Container Monitoring Fixed (no false alerts)
|
||||
✅ Service Monitoring Fixed (CloudReve)
|
||||
✅ OOM Detection Script Fixed
|
||||
✅ Failed Login Monitoring Fixed
|
||||
✅ Test Notifications Delivered Successfully
|
||||
|
||||
═══════════════════════════════════════════════════════════
|
||||
MONITORING SCRIPTS (18 Total)
|
||||
═══════════════════════════════════════════════════════════
|
||||
|
||||
Every 5 Minutes:
|
||||
✅ check-containers.sh (docker, cloudreve, gitea, sftpgo)
|
||||
✅ check-vm-shutdowns.sh (detect unexpected VM/CT stops)
|
||||
|
||||
Every 15 Minutes:
|
||||
✅ check-services.sh (HTTP health checks)
|
||||
✅ check-databases.sh (PostgreSQL, Redis, aria2)
|
||||
✅ check-docker-restarts.sh (restart loops)
|
||||
|
||||
Every Hour:
|
||||
✅ check-pve-host.sh (PVE disk, RAM, CPU, services)
|
||||
✅ check-all-vm-disks.sh (ALL VMs disk space)
|
||||
✅ check-network-storage.sh (Fred NFS, iMac CIFS)
|
||||
✅ check-thin-pools.sh (CRITICAL - VM freeze prevention)
|
||||
✅ check-ceph.sh (Ceph cluster health)
|
||||
✅ check-tailscale.sh (VPN connectivity)
|
||||
✅ check-oom.sh (out of memory killer)
|
||||
✅ check-temperature.sh (CPU/disk temps)
|
||||
✅ check-network.sh (public IP changes)
|
||||
✅ check-failed-logins.sh (security monitoring)
|
||||
|
||||
Daily (3 AM):
|
||||
✅ check-backups.sh (backup job status)
|
||||
✅ check-ssl-certs.sh (certificate expiry)
|
||||
✅ check-updates.sh (system updates)
|
||||
|
||||
Weekly (Sunday 2 AM):
|
||||
✅ check-network.sh --speedtest (internet speed)
|
||||
|
||||
═══════════════════════════════════════════════════════════
|
||||
NTFY TOPICS (Secure)
|
||||
═══════════════════════════════════════════════════════════
|
||||
|
||||
🔴 anthony-homelab-95ccf258e17eba20-critical
|
||||
🟡 anthony-homelab-95ccf258e17eba20-warning
|
||||
🔵 anthony-homelab-95ccf258e17eba20-info
|
||||
|
||||
Security: Topics use random hex (impossible to guess)
|
||||
Privacy: Nobody can spy on your notifications
|
||||
|
||||
═══════════════════════════════════════════════════════════
|
||||
ISSUES FIXED
|
||||
═══════════════════════════════════════════════════════════
|
||||
|
||||
✅ False Alert: Container 100
|
||||
- Was trying to check VM 100 as container
|
||||
- Fixed: Script now skips non-existent containers
|
||||
|
||||
✅ False Alert: CloudReve Unreachable
|
||||
- Was checking wrong IP address (DHCP changed)
|
||||
- Fixed: Now checks from inside container (reliable)
|
||||
|
||||
✅ OOM Script: Variable handling errors
|
||||
- Fixed: Proper variable initialization
|
||||
|
||||
✅ Failed Logins Script: Unbound variables
|
||||
- Fixed: Proper error handling
|
||||
|
||||
═══════════════════════════════════════════════════════════
|
||||
WHAT YOU ACCOMPLISHED TODAY
|
||||
═══════════════════════════════════════════════════════════
|
||||
|
||||
💾 Freed 46GB on debianvm (91% → 57%)
|
||||
📀 Expanded VM 280 disk by 7GB (97% → 87%)
|
||||
🔧 Fixed CloudReve/aria2 integration
|
||||
📊 Implemented 18 comprehensive monitors
|
||||
🔒 Secured notifications (obscure topics)
|
||||
🎯 Centralized on PVE host
|
||||
✅ Fixed false positive alerts
|
||||
🔍 Verified all systems working
|
||||
|
||||
═══════════════════════════════════════════════════════════
|
||||
NEXT ACTIONS
|
||||
═══════════════════════════════════════════════════════════
|
||||
|
||||
✅ Monitor notifications for 1 week
|
||||
✅ Verify no false positives
|
||||
✅ After 1 week: Disable old DEBIANVM monitoring
|
||||
✅ Adjust thresholds if needed
|
||||
|
||||
═══════════════════════════════════════════════════════════
|
||||
USEFUL COMMANDS
|
||||
═══════════════════════════════════════════════════════════
|
||||
|
||||
View timers: systemctl list-timers homelab-monitor-*
|
||||
View logs: journalctl -t homelab-monitor -n 50
|
||||
Test alert: /usr/local/bin/send-ntfy.sh info "Test" "Msg" "test"
|
||||
Run check: /usr/local/bin/check-pve-host.sh
|
||||
|
||||
═══════════════════════════════════════════════════════════
|
||||
DOCUMENTATION FILES
|
||||
═══════════════════════════════════════════════════════════
|
||||
|
||||
/root/MONITORING-FINAL-SUMMARY.md - Complete documentation
|
||||
/root/QUICK-REFERENCE.txt - Quick reference card
|
||||
/root/VERIFICATION-REPORT.txt - This file
|
||||
/root/.ntfy-topics - Secure topic names
|
||||
|
||||
═══════════════════════════════════════════════════════════
|
||||
SYSTEM STATUS: ✅ FULLY OPERATIONAL
|
||||
═══════════════════════════════════════════════════════════
|
||||
3
docs/ntfy-topics.txt
Normal file
3
docs/ntfy-topics.txt
Normal file
@@ -0,0 +1,3 @@
|
||||
TOPIC_CRITICAL=anthony-homelab-95ccf258e17eba20-critical
|
||||
TOPIC_WARNING=anthony-homelab-95ccf258e17eba20-warning
|
||||
TOPIC_INFO=anthony-homelab-95ccf258e17eba20-info
|
||||
Reference in New Issue
Block a user