For those of you aren't familiar with collectl, it's a lightweight performance monitoring tool that's been around for a number of years and used on some of our largest clusters. One of its features is to be able to write data in a plottable format and another is to communicate with external programs. Over the years I've developed several utilities that take advantage of these cabilities and packaged them up as a set of utilities that are now available at http://collectl-utils.sourceforge.net.. The thing that I think is pretty cool about the utility called colplot is that it's web-based so there's no complicated switches to learn and it has the ability to line up all the plots it generates so you can easily see what's happening on the same or multple systems at the same points in time.
Another utility is called colmux and it has the ability to start collectl running on multiple systems and display the results in real-time on the same line at any frequency you want though once a second seems to be the most useful.
The webpage referred to above has more details on these and even some screen shots. In fact the webpage for colmux has a picture I took of a portion of a 2300 node cluster - only 192 nodes - and shows their CPU loads once a second. The display is so wide it takes 3 oversized monitors to show it all and I just couldn't resist taking a picture of it. Even though you can't read the displays you can still see patterns and be able to tell which machiens are busy and which are idle.
Check them out (and if you haven't tried collectl yet try it out too) and be sure to let me and the rest of the community know what you think.
-mark
Posted
10-09-2009 3:42 PM
by
mark.seger@hp.com