Add alerting for critical node or hardware status
Depends on #7 (closed)
Once alerting is in place, it would be good to add alerts for critical events that aren't monitored. Prometheus Operator shipped with a fantastic set of alerts, but some metrics could be added. At least:
-
temperature approaching maximum -
temperature exceeding maximum -
temperature approaching critical -
temperature exceeding critical
Could be added. Although that last one is likely to kill node_exporter, the first three would be very useful.
Edited by justin