Templates: Server Health Checks – Generic UNIX-like Server

Monitors included into Server health checks – Generic UNIX-like server template

Server health checks – Generic UNIX-like server application template includes a number of predefined monitors, most of which can be used any UNIX-like servers configurations (some monitors are only useful when corresponding service is actually used, for example: LPD Queue and LPD Status are only sensible if a printer device is installed). More about templates.

Note that not all the template monitors are useful in general; see the below tips and tricks section for more information. You can also add monitors to the application created from this template, thus expanding its capabilities.

Monitors list

Monitors description

The below monitors are part of Generic UNIX-like server template (Server health checks category):

Load average *100, 1 min (enabled by default) is popular metrics for UNIX systems, giving an idea of how much resources are used. It includes measuring usage for CPU, RAM and disk I/O.
The load average doesn’t actually gives any precise measure of resources consumption, but a value that can be interpreted depending on actual resource usage (for example, every resource using/waiting for CPU core, adds 1 to average value). The general rule is: the less load average, the better.

PING (enabled by default) is common lightweight network availability tool; it means sending a special ICMP packet and wait for response. It is widely used to check network devices states and to measure latency. Note, though, that ICMP can be easily blocked server-side, thus making this tool inefficient.

UPTIME (enabled by default) measures amount of time the server is running. Note that nowadays security updates are being released very frequently, and in most cases require rebooting the computer. Thus, long uptime periods may indicate that the system has an update missing.

CPU usage shows CPU load value (Total Active Time, in %). The higher the load, the more time system is busy. Depending on what target server does, its expected normal values for CPU load can vary; the actual performance value, thus should be chosen on case by case basis.

Metrics that can be used for this monitor include: Total Active Time, User Time, System Time, Wait IO, Run queue length, Interrupts per second, Context switches per second.

Platform check is a specific monitor named Match monitor. The Network Discovery Wizard uses this monitor to find out if the template can be applied to the host. It is disabled by default and there is no need to enable it, it does not measure any valuable characteristic.

Free disk space on / reports disk space available on root filesystem, measured in percentage. Buy default, this monitor will warn you if there is less than 10% free space on the filesystem.

Depending on what kind of software is used, definite amount of free space should also be available, otherwise software components may start to fail. Similarly, other typical mount points can be checked, if they actually use different filesystems (such as /tmp, /home etc). Typically, 5-10% of total filesystem space should be always free.

Free physical memory shows amount of free memory available (as percentage). Depending on software running on target system, amount of recommended free RAM can vary. However, at least 5-10% is recommended to keep free; if actual free RAM is often below that safe threshold, alarm should be signaled.

LPD Queue (Line Printer Daemon queue length) reports how many printing tasks are waiting in specified printer device queue. If the queue is above zero for a long time, it’s recommended to check whether the printer is in order (there are no error conditions, e.g. there’s enough toner, there’s paper etc). This monitor is disabled by default and should be enabled only if a printer device is configured on the target host.

LPD Status (Line Printer Daemon status code) is 1 if the printer device is in normal state and expecting printing jobs. Otherwise, if details are required on error state, SNMP should be used to get detailed printer status (if supported). This monitor is disabled by default and should be enabled only if a printer device is configure on the target host.

Process count allows to find how many processes are running (total count).

SSH connection time displays time required to run a single attempt to open SSH session (containing no actual program running). The acceptable performance value for this monitor depends on target system’s speed and load. It is advised to watch the performance value during different loads/system state, to select proper values and to raise alerts on actually too long running time.

Swap memory in use measures amount of memory allocated in swap devices, currently in use. Swapping means storing data located in RAM to a storage device, to free it by use of another process. The more swap is in use, the less memory is available for applications, the more applications are competing for physical memory.

Traffic speed total, kbit/s is measured on per interface basis. Traffic speed should be measured to detect traffic consumption surges – that can indicate there’s misbehaving application, using too much traffic. If several active network interfaces exist, additional Traffic speed monitors can be used to watch those connections as well.

Traffic volume total is also measured on per interface basis and counts inbound, outbound or both directions traffic registered on selected network interface.

Monitoring traffic volume allows checking for possible resource-consuming, malevolent or runaway applications. If server has several interfaces, separate Traffic volume monitor(s) can be added, thus allowing monitoring VPNs and similar services, when installed.

Users logged in monitor returns count of users having active sessions on the server (either via ssh connection, by running terminal processes, or at the console). Basically, this is count of entries returned by ‘w -h’ command. Users logged in count is useful to monitor users activity, to detect certain user visits – you only would need to modify the script’s command line of the monitor.

Zombie count monitor calculates number of “zombie”, or “defunct” processes in the system. By definition, zombie process is the one that has completed its lifecycle by calling “exit” function, but hasn’t yet been removed from process table. Normally, parent process checks for its child processes and, as soon as it acknowledges their termination, defunct processes are removed (in absolute majority of cases, quickly enough to not get noticed). In that respect, all processes are becoming zombie first, and then are finally removed from system records. If zombie count is positive and remains positive for considerable time, it indicates a problem, such as software malfunction, and can lead to resources leak.

Zombie count monitor should be turned on for development and resource-intensive servers, to detect the resource usage problems quickly. It will warn you if zombie count exceeds 0.

Server health checks – Generic UNIX-like server use cases

  • process count should be used for every server running certain services (processes). Process count monitor should be created for every service that should be running constantly (thus you might need to clone several such monitors)
  • if high availability is expected, Load average should be monitored, as well as amount of free disk space, free physical memory and CPU usage
  • file servers should rely on enough free disk space, good amount of free physical memory, low average load; also, they depend on high enough traffic speed
  • backup servers should have stable traffic speed and much free disk space
  • game and media streaming servers should benefit from free physical memory, high traffic speed, good CPU utilization should be possible
  • development servers should rely on good amount of free physical memory, much free disk space and zero zombie count
  • print servers, along with the same monitors file servers use, will monitor LPD Queue and watch LPD Status

Server health checks – Generic UNIX-like server tips

  • PING monitor should be turned on in any case, it can be used as both general connectivity check and as monitor the other monitors will depend upon (if PING goes down, depending monitors will be stopped by dependency)
  • there may be several file systems mounted; add more free space monitors if required (in case you need to ensure all vital file systems are checked against low space condition)
  • do not include all the monitors by default; only watch the metrics that is essential

Templates overview

IPHost Network Monitor provides application templates (or just “templates” later in document), to create multiple relevant monitors in only a few clicks. Templates facilitate adding typical monitors sets; this can be particularly useful in case of big networks, when creating same-type monitors for many same-type devices is a common task. Application templates are sets of monitors that can be added, using specific predefined parameters, for a given host at once. The said set, added for given host, is displayed as a separate node in tree view pane, and is named application.

There are predefined templates; user can as well generate templates of their own – either out of existing monitors, or by cloning a predefined template. User-added template definitions are saved in XML files and can thus be conveniently augmented or applied to specific needs.