I received a new HPC Multi-core server– Learning from standard benchmarks - Reality Check: Server Insights -
I received a new HPC Multi-core server– Learning from standard benchmarks

 

Now that we have configured the hardware components and firmware settings in a known and hopefully optimum way, it is time to run the 1st performance test.  Personally, I like to run the memory bandwidth benchmark lmbench first, since there is a lot to learn from it.  This benchmark computes the time required to move different amounts of data from the caches and memory to a core.

A modern server has multiple levels of caches - 2 or 3 levels.  The highest level may be shared among several cores.  The output of lmbench shows plateaus for each level of cache, showing both the latency to the cache and also the size of the cache.  If the system has NUMA memory organization, then lmbench shows the latency for each of the NUMA "hops", as the data traverses the system topology.

Usually, the latency of each cache level is a fixed number of processor clock cycles.  It is both interesting and important to know this number.  It's interesting, because it allows you to compare different cache architectures, even if the systems you are comparing have different clock speeds.  For example, it is interesting to me that the 1st level cache latency of several modern servers, with very different architectures, is 3 cycles.

And it is important, because you are not always sure what clock speed your system is running, and you can compute it using lmbench.  Or, you might encounter the problem we hit yesterday - on a 2-processor pre-production server, the two processors were running at 2 different clock speeds!  This is of course not good - a user observed strange performance, but lmbench took the mystery out of the problem and told us exactly what was happening.

You can run lmbench in different ways, and each provides additional understanding. 

The "stride" method runs sequentially through memory addresses and provides best-case latency. 

The "random" method accesses memory addresses randomly, showing the worst-case latency.  It is very useful to know the best-case and worst-case latencies - if their ratio is small, then memory performance is somewhat predictable.  If the ratio is large, prediction becomes difficult.

If you set lmbench to access memory in units of the VM page size, then you can observe the impact of TLB misses on latency.

And, if you run lmbench simultaneously on cores which share a cache, you learn about the behavior of the shared cache.

Memory latency is a great tool if you need to unravel the architecture of a new server!


Posted 10-15-2008 10:07 PM by d-field

Add a Comment

(required)  
(optional)
(required)  
Remember Me?

Type the numbers and letters above:
Powered by Community Server (Non-Commercial Edition), by Telligent Systems