Live resource usage

2018-12-19

We have been monitoring most farm machines with munin for some time. This allows anyone to check whether a machine is heavily loaded or not before starting to use it, but it is time-consuming to check many graphs from several machines.

To quickly see the load of a machine, we now display a usage bar directly in the list of machines! We currently show 3 metrics: CPU usage, memory usage, and disk I/O load. The values are based on the last 48 hours of munin data and should give a good overview of current usage and expected performance.

Nethertheless, these values are indicative and only reflect the average usage. If you plan to use a machine for heavy tasks, you should check the munin graphs to better understand the usage pattern and make sure you don't disrupt the tasks of existing users. In addition, a few values are missing or incorrect: this happens either because the machine was not recently monitored by munin, or because the total number of CPU is incorrectly detected.

The value displayed in the usage bars is computed as a weighted average that gives more importance to high usage values. For instance:

  • a machine that is 20% busy for 100% of the time will get a weighted average of 20%
  • a machine that is 50% busy for 40% of the time will get a weighted average of 43.6%
  • a machine that is 100% busy for 20% of the time will get a weighted average of 63.8%

In each case, the arithmetic average would be the same (20% usage). However, we consider that the machine is more loaded in the third case.  Intuitively, if you would like to run a new task, using a machine that is already 100% busy (even only 20% of the time) is generally a bad idea: the new task might significantly interfere with the existing ones.  In contrast, a machine that is 20% busy still has a lot of room for more tasks.