I received a new HPC Multi-core server today – Measuring memory BW - Reality Check: Server Insights -
I received a new HPC Multi-core server today – Measuring memory BW

 

The lmbench memory latency benchmark gave us a lot of information about the new system.  Next, we ran the STREAM memory BW benchmark suite.

Before running HPC codes, we needed to ensure that hardware multi-threading is disabled, if it exists.  This feature allows each core on the server to appear to the OS as if it were 2 cores.  For many applications, this technique increases throughput.  But for nearly every HPC application, a server has insufficient memory BW to dilute it with multi-threading.

We then measured memory BW - we used the industry standard benchmark, STREAM.

There are four benchmarks in the STREAM suite: COPY, ADD, SCALE, and TRIAD.  There are multiple ways to run each of these benchmarks.  We started by running each benchmark in serial (1 non-threaded copy of each benchmark), one at a time.  This shows us the maximum amount of memory BW that one core can consume.

Next, we measured memory BW consumed by multiples of 2 cores, up to the maximum number of cores.  There are 2 obvious ways to do this:  run multiple simultaneous copies of the serial benchmark, or run a single benchmark that is multi-threaded using SMP parallelism (via OpenMP).  We usually run the SMP-parallel STREAM benchmarks.

We ran each SMP-parallel STREAM benchmark from 2 cores to the maximum, in 2 different arrangements:

-Packed - use all cores on each processor before going to next processor

-Cyclic - alternate cores among the processors

In Packed mode, the memory BW usually increases monotonically with the number of cores.

In Cyclic mode, the memory BW has a zig-zag pattern and often approaches the maximum memory BW of the system when only one core per processor is used.

By observing these measurements, it is possible to determine how the memory BW is allocated to the processors.  Each processor may have a unique path to the memory system and therefore an amount of memory BW that is (roughly) independent of the memory usage of the other processors.  Or, two or more processors may share a path to memory.

Normally, the maximum measurement for each benchmark occurs when all cores in the server are used.   But we have seen strangely architected systems which reach a maximum memory BW using a subset of the cores, after which memory BW declines with additional cores.

One interesting bit of information is obtained by comparing the memory BW measurement used by all the cores on one processor vs. the BW used by 1 core.  This tells us the fraction of the processor's memory BW that a single core can consume.  In many cases, one core can consume the entire memory BW available to the entire processor; one core can often consume a significant fraction of the total system's memory BW.  When applications requiring high memory BW are run on such a server, it is impractical to run the application on more than one core per processor (or even one core per server).

All these measurements may seem excessive, but memory BW is often the performance limiter in HPC applications.  If an application developer knows the STREAM information, the application can be run in an optimal way on a server - using a subset of the cores to run the application, without demanding more memory BW per core than the system can provide.


Posted 10-21-2008 2:39 PM by d-field

Add a Comment

(required)  
(optional)
(required)  
Remember Me?

Type the numbers and letters above:
Powered by Community Server (Non-Commercial Edition), by Telligent Systems