Performance monitoring best practices

When you design a servers performance monitoring system there are several things that you will have to consider. Best practices when implementing such systems are:

  • Set up a monitoring configuration
  • Keep monitoring overhead low
  • Centralized place for monitoring
  • Analyze performance results and establish a performance baseline
  • Set alerts
  • Tune performance
  • Plan ahead

When setting up a monitoring system you have to consider what kind of system is “good enough” for you. You will have to decide if you go with an opensource monitoring system or if you go with a commercial system. Since i’m not a fan of commercial closed source system i will focus on opensource solutions:

  1. Nagios – Nagios is a powerful monitoring system that enables organizations to identify and resolve IT infrastructure problems before they affect critical business processes
  2. Cacti – Cacti is a complete network graphing solution designed to harness the power of RRDTool‘s data storage and graphing functionality. Cacti provides a fast poller, advanced graph templating, multiple data acquisition methods, and user management features out of the box.
  3. Munin – Munin the monitoring tool surveys all your computers and remembers what it saw. It presents all the information in graphs through a web interface. Its emphasis is on plug and play capabilities.
  4. You may also want to take a look at http://compari.tech/bandwidthmonitoring for some other useful bandwidth monitoring tools

Nagios has the advantage that it can be set up to send SMS alerts to predefined groups of users in case of alerts. Cacti has the advantage that you can evaluate in time how your systems performs and you can have a good idea of the trends. Munin can monitor certain aspects better than Cacti but is more invasive on the systems you install it on.

Next thing will be to keep the monitoring overhead low. This can be done by :

  1. Don’t query the servers too often.
  2. Monitoring system should be tun on a standalone server that does monitoring and nothing else.
  3. Archive unneeded data.
  4. Use asynchronous requests when possible

On previous point i said that monitoring system should run on a standalone server . This means exactly Centralized place for monitoring .

Ideally, all logs from different areas of monitoring should be stored in a centralized place where one UI can be used to analyze the data. Based on your user scenarios, consider identifying which teams to partner with, so log data can be viewed as a coherent whole. The reasons behind centralization are:

  1. You can easy implement a strict user control / user policy / procedures ( You will need it if you need  Sarbanes-Oxley compliant )
  2. Minimize the admin time. Imagine that you have 20 servers and each one with it’s own monitoring system.
  3. Giving access to some users on relevant graphs / logs is easy
  4. You can get an overview on the whole system

After you implemented the system and data starts to pile up you can do an analysis of performance results . This should be done as often as possible in order to identify trends and also to catch “exceptions”. For example at the end of each month servers that runs accounting will have increased load than on a normal day. If you do not pay attention you might find yourself in pretty delicate position when users requests more capacity or more processing power and according to trend it wasn’t necessary.

After getting a base line for the performance you can Set alerts for moments that systems behave out of the ordinary or for problems with the system. For example if a server uses 15G RAM out of 16G RAM you might want to be notified about that to schedule a downtime to add more RAM or to see what is going on with the applications running on that server.

Performance tunning is a delicate job and take an awful lot of time. Because a system can be optimized according to a scenario. If the data doesn’t fit in that scenario you might need to adjust servers parameters in order to adapt to the scenario. Databases, apache servers, kernel parameters can be tuned to suit your needs.

Also the baseline and graphs of the performance allows you to Plan ahead the evolution of your systems. For example you can predict with good accuracy when or if your will need to purchase new hardware or when you will need to upgrade your existing systems.