Data centers are complex beasts, and no amount of operator monitoring by itself can keep track of everything. That’s why automated monitoring is so important.
So what should you monitor? You can divide up your monitoring into a couple of strategic areas. Just as with metrics collection, there is business & application level monitoring and then there is lower level system monitoring which is also important.
Business & Application Monitoring
If a user is getting an error page or cannot connect
If an e-commerce transaction is failing
General service outages
If a business goal is met – or not
Page timeouts or slowness
Systems Level Monitoring
Backups completed and success
Error logs from database, webserver & other major services like email
Database replication is running
Webserver timeouts
Database timeouts
Replication failures – via error logs & checksum checks
Memory, CPU, Disk I/O, Server load average
Network latency
Network security
Tools that can perform this type of monitoring include Nagios,