Monitoring low level server hardware health



  • I was pondering this yesterday evening. Let's say you have a few servers in colocation. How do folks monitor RAID health, whether or not a PSU failed, component temperature, etc.?

    I suppose one way would be connect your server's out-of-band management port to you LAN, connect to a VM acting a jumpbox, and from that VM access those management interfaces (since you'd obviously not expose the out-of-band interfaces themselves to the Internet. One problem with this would be the fact you would have to check each server once at a time.

    Perhaps another answer is see if there's some kind of monitoring software provided by (or purchased from) your server vendor. You'd install an agent on your physical boxes, which talk to a VM, to which you can connect and see a dashboard and drill down into details.

    I supposed a third way is to offload such monitoring to the colocation staff; thus, they'll install whatever monitoring software they use and do the monitoring for you.



  • I would just setup something like Zabbix and then install the Zabbix agent on each host to monitor relevant sensors and alert as things come up.



  • Might depend on the Server and what they have built in.

    Dell have the iDRAC that you can log onto and look at hardware logs. (Not sure if you can setup E-mail alerts been a while since i've used one)



  • On Dell you can install OpenManager right on the host (if running Hyper-,V I think @JaredBusch has a guide on here somewhere) and use that to monitor it and email alerts or send SNMP traps.



  • @hobbit666 said in Monitoring low level server hardware health:

    Might depend on the Server and what they have built in.

    Dell have the iDRAC that you can log onto and look at hardware logs. (Not sure if you can setup E-mail alerts been a while since i've used one)

    The new ones you can since version 7 I believe.