Templates: Server Health Checks - Generic SNMP-enabled Device

Monitors included into Server health checks – Generic SNMP-enabled device template

Server health checks – Generic SNMP-enabled device application template includes a number of predefined monitors that can be used for a typical SNMP-enabled network device, like a router. More about templates.

Note that not all the template monitors are essential in general; see the below tips and tricks section for more information. You can also add monitors to the application created from this template, thus expanding its capabilities.

Monitors list

Load average *100, 1 min

PING

Uptime

Traffic speed total, kbit/s

Traffic volume total

Users logged in

LinkUp/LinkDown trap

Monitors description

The below monitors are part of Generic SNMP-enabled device template (Server health checks category):

Load average *100, 1 min (enabled by default) gives an idea of how much resources are used. It includes measuring usage for CPU, RAM and disk I/O.
The load average doesn’t actually gives any precise measure of resources consumption, but a value that can be interpreted depending on actual resource usage (for example, every resource using/waiting for CPU core, adds 1 to average value). The general rule is: the less load average, the better.

PING (enabled by default) is common lightweight network availability tool; it means sending a special ICMP packet and wait for response. It is widely used to check network devices states and to measure latency. Note, though, that ICMP can be easily blocked server-side, thus making this tool inefficient.

UPTIME (enabled by default) measures amount of time the server is running. Note that nowadays security updates are being released very frequently, and in most cases require rebooting the computer. Thus, long uptime periods may indicate that the system has an update missing.

CPU usage shows CPU load value (Total Active Time, in %). The higher the load, the more time system is busy. Depending on what target server does, its expected normal values for CPU load can vary; the actual performance value, thus should be chosen on case by case basis.

Metrics that can be used for this monitor include: Total Active Time, User Time, System Time, Wait IO, Run queue length, Interrupts per second, Context switches per second.

Free disk space on / reports disk space available on root filesystem, measured in percentage. This monitor will warn you if there is less than 10% free space on the filesystem.

Depending on what kind of software is used, definite amount of free space should also be available, otherwise software components may start to fail. Similarly, other typical mount points can be checked, if they actually use different filesystems (such as /tmp, /home etc). Typically, 5-10% of total filesystem space should be always free.

Free physical memory shows amount of free memory available (as percentage). Depending on software running on target device, amount of recommended free RAM can vary. However, at least 5-10% is recommended to keep free; if actual free RAM is often below that safe threshold, alarm should be signaled.

Network in errors shows the number of inbound packets that contained errors preventing them from being deliverable to a higher-layer protocol. May indicate either a degradation in connectivity or hardware defect. You may need to correct ‘OID to monitor’ parameter to use an actual network interface.

Network in errors shows the number of outbound packets that could not be transmitted because of errors. May indicate either a degradation in connectivity or hardware defect. You may need to correct ‘OID to monitor’ parameter to use an actual network interface.

Process count allows to find how many processes are running (total count). For certain network devices there might be specific programs (processes) that should always be running, thus checks for the corresponding process (name and argumets can be specified) count should be above zero.

Note that you can configure an alert for Down state that can trigger attempt to re-launch stopped process via SNMP Set command (i.e., emulate the task of programs such as Supervisor)

Swap memory in use measures amount of memory allocated in swap devices, currently in use. Swapping means storing data located in RAM to a storage device, to free it by use of another process. The more swap is in use, the less memory is available for applications, the more applications are competing for physical memory.

CPU temperature Overall CPU temperature in degrees C, obtained on sensor #1. High values may indicate hardware malfunction.

Traffic speed total, kbit/s is measured on per interface basis. Traffic speed should be measured to detect traffic consumption surges – that can indicate there’s misbehaving application, using too much traffic. If several active network interfaces exist, additional Traffic speed monitors can be used to watch those connections as well.

Traffic volume total is also measured on per interface basis and counts inbound, outbound or both directions traffic registered on selected network interface.

Monitoring traffic volume allows checking for possible resource-consuming, malevolent or runaway applications. If device has several interfaces, separate Traffic volume monitor(s) can be added.

Users logged in monitor returns count of users having active sessions on the device (either via ssh connection, by running terminal processes, or at the console).

LinkUp/LinkDown trap – this monitor changes its state to OK if LinkUp SNMP trap is received from the selected network interface. The monitor changes its state to Down if LinkDown SNMP trap is received form the selected network interface. The monitor remains in the Unknown state until the first trap is received. Note that you need to specify the network interface for this monitor manually.

Server health checks – Generic SNMP-enabled device use cases

process count should be used for every server running certain services (processes). Process count monitor should be created for every service that should be running constantly (thus you might need to clone several such monitors)
if high availability is expected, Load average should be monitored, as well as amount of free disk space, free physical memory and CPU usage
file servers should rely on enough free disk space, good amount of free physical memory, low average load; also, they depend on high enough traffic speed
backup devices should have stable traffic speed and much free disk space
game and media streaming devices should benefit from free physical memory, high traffic speed, good CPU utilization should be possible
print servers, along with the same monitors file servers use, will monitor LPD Queue and watch LPD Status

Server health checks – Generic SNMP-enabled device tips

PING monitor should be turned on in any case, it can be used as both general connectivity check and as monitor the other monitors will depend upon (if PING goes down, depending monitors will be stopped by dependency)
there may be several file systems mounted; add more free space monitors if required (in case you need to ensure all vital file systems are checked against low space condition)
do not include all the monitors by default; only watch the metrics that is essential

Templates overview

IPHost Network Monitor provides application templates (or just “templates” later in document), to create multiple relevant monitors in only a few clicks. Templates facilitate adding typical monitors sets; this can be particularly useful in case of big networks, when creating same-type monitors for many same-type devices is a common task. Application templates are sets of monitors that can be added, using specific predefined parameters, for a given host at once. The said set, added for given host, is displayed as a separate node in tree view pane, and is named application.

There are predefined templates; user can as well generate templates of their own – either out of existing monitors, or by cloning a predefined template. User-added template definitions are saved in XML files and can thus be conveniently augmented or applied to specific needs.