Linux Kernel where is going

At Linux Foundation Collaboration Summit 14 April 2010 it was a very interesting session on Q and A about linux kernel. Here are the main ideas discussed:

1) Documentation (lack of) – The documentation itself should be inside the kernel itself.

2) Need for new blood . Some of the original developers are getting old.

3) Keeping Independent thinkers is very important.

4) Fixing scheduler is around the corner. RTT will be merged into kernel.

5) Android is forking a part of kernel. There will be tries to merge codes between Google and Linux Kernel.

6) A person ( or persons ) are needed to reduce the kernel bloat ( kernel foot print )

DRBD is finally in kernel

DRBD refers to block devices designed as a building block to form high availability (HA) clusters. This is done by mirroring a whole block device via an assigned network. DRBD can be understood as network based raid-1.
Today i happily found that DRBD was added to the upcoming linux kernel 2.6.33. Until now TFM could support DRBD but sometimes it got obsolete or we had lots of compilation problems because we are keeping up to date to kernel tree and DRBD was a little slow when big changes appeared in kernel.
In the next few days we will prepare and add to TFM some ready to use examples regarding DRBD.
Ralink’s RT2860 drivers and support for RT3090 PCI Wi-Fi chips were also added to the kernel. As usual around Christmas there are a lot of developments in Linux World.

Performance monitoring best practices

When you design a servers performance monitoring system there are several things that you will have to consider. Best practices when implementing such systems are:

  • Set up a monitoring configuration
  • Keep monitoring overhead low
  • Centralized place for monitoring
  • Analyze performance results and establish a performance baseline
  • Set alerts
  • Tune performance
  • Plan ahead

When setting up a monitoring system you have to consider what kind of system is “good enough” for you. You will have to decide if you go with an opensource monitoring system or if you go with a commercial system. Since i’m not a fan of commercial closed source system i will focus on opensource solutions:

  1. Nagios – Nagios is a powerful monitoring system that enables organizations to identify and resolve IT infrastructure problems before they affect critical business processes
  2. Cacti – Cacti is a complete network graphing solution designed to harness the power of RRDTool‘s data storage and graphing functionality. Cacti provides a fast poller, advanced graph templating, multiple data acquisition methods, and user management features out of the box.
  3. Munin – Munin the monitoring tool surveys all your computers and remembers what it saw. It presents all the information in graphs through a web interface. Its emphasis is on plug and play capabilities.
  4. You may also want to take a look at http://compari.tech/bandwidthmonitoring for some other useful bandwidth monitoring tools

Nagios has the advantage that it can be set up to send SMS alerts to predefined groups of users in case of alerts. Cacti has the advantage that you can evaluate in time how your systems performs and you can have a good idea of the trends. Munin can monitor certain aspects better than Cacti but is more invasive on the systems you install it on.

Next thing will be to keep the monitoring overhead low. This can be done by :

  1. Don’t query the servers too often.
  2. Monitoring system should be tun on a standalone server that does monitoring and nothing else.
  3. Archive unneeded data.
  4. Use asynchronous requests when possible

On previous point i said that monitoring system should run on a standalone server . This means exactly Centralized place for monitoring .

Ideally, all logs from different areas of monitoring should be stored in a centralized place where one UI can be used to analyze the data. Based on your user scenarios, consider identifying which teams to partner with, so log data can be viewed as a coherent whole. The reasons behind centralization are:

  1. You can easy implement a strict user control / user policy / procedures ( You will need it if you need  Sarbanes-Oxley compliant )
  2. Minimize the admin time. Imagine that you have 20 servers and each one with it’s own monitoring system.
  3. Giving access to some users on relevant graphs / logs is easy
  4. You can get an overview on the whole system

After you implemented the system and data starts to pile up you can do an analysis of performance results . This should be done as often as possible in order to identify trends and also to catch “exceptions”. For example at the end of each month servers that runs accounting will have increased load than on a normal day. If you do not pay attention you might find yourself in pretty delicate position when users requests more capacity or more processing power and according to trend it wasn’t necessary.

After getting a base line for the performance you can Set alerts for moments that systems behave out of the ordinary or for problems with the system. For example if a server uses 15G RAM out of 16G RAM you might want to be notified about that to schedule a downtime to add more RAM or to see what is going on with the applications running on that server.

Performance tunning is a delicate job and take an awful lot of time. Because a system can be optimized according to a scenario. If the data doesn’t fit in that scenario you might need to adjust servers parameters in order to adapt to the scenario. Databases, apache servers, kernel parameters can be tuned to suit your needs.

Also the baseline and graphs of the performance allows you to Plan ahead the evolution of your systems. For example you can predict with good accuracy when or if your will need to purchase new hardware or when you will need to upgrade your existing systems.