I received a new HPC Multi-core server today – Running on a subset of cores - Reality Check: Server Insights -
I received a new HPC Multi-core server today – Running on a subset of cores

 

In some situations, it is useful to not use some of the cores on a server.  Since most processors do not have sufficient memory BW to support a memory BW-intensive code running on all cores, such codes do not "scale" perfectly.  There are 2 common ways to define scaling - serial-job throughput workload (multiple serial jobs), and a single parallel code workload. 

If scaling is perfect, then an 8-core server can run 8 copies of a serial job in the same time as one serial job. 

For a highly scalable application: if scaling is perfect, then an 8-core server runs an 8-way-parallel job 8 times faster than a serial job.

Most HPC jobs can not scale perfectly, so this is an issue.  But in most cases, the server can run more total work (jobs per day) using all of its cores than it can if some cores are unused.  So why would we consider leaving some cores idle?  The primary reason is the cost of running licensed applications.  Many HPC applications are licensed on a per-core basis, although the cost may not be linear with the number of cores.  It is useful to compare the per-core job performance to the per-core license cost to determine the best performance-to-cost operating point.

Given that it may be useful to use a subset of the cores, doing so correctly is difficult.  You need to know the architecture of your server.  Here is an example of an HP ProLiant server containing two Intel Xeon Harpertown quad-core processors. 

-Each processor has a separate connection to the memory system.

-Each processor has 4 cores.  Each pair of cores shares a data cache.  The 4 cores share the processor's memory BW.

If you draw a picture of this, you will see that not all combinations of cores are equal in terms of cache size and memory BW resources.

Let's say we want to run a workload consisting of one parallel job, using only 4 cores in the server.  The best performance is obtained using 2 cores on each processor, and selecting cores which do not share a cache (so that one core has full use of the entire shared cache).  The next-best performance uses 2 cores per processor, and selecting cores which share a cache.  The 3rd-best performance uses 3 cores on one processor and one core on the other processor.  The worst performance is obtained using 4 cores on one processor.

If we want to run a single parallel job on only 2 cores in the server, there are three possible choices.  The best performance uses 1 core per processor.  The next-best performance uses 2 cores on one processor, selecting cores which do not share a cache.  Third best uses 2 cores on one processor, selecting cores which share a cache.

How big is the performance difference based on these choices?  The answer depends on the application, the specific input data, and other factors.   But here is an example, for a moderately parallel HPC application.  Running the code in parallel using one of its standard performance benchmarks, it runs 4 times faster using all cores (8-way-parallel) than using 1 core (serial).

Running 4-way-parallel using the above choices of 4 cores, the performance varies from 3.4 (best case) to 2.9 (next-best case) to 2.4 (third-best case) to 2.3 (worst case) times faster than serial.

Running 2-way-parallel using the above choices of 2 cores, the performance is either 2.0, 1.8, or 1.5 times faster than serial.

Clearly we can lose a lot of performance if we do not select the cores carefully!

 

This analysis would be different for different processors or different server configurations.


Posted 10-29-2008 8:57 PM by d-field

Add a Comment

(required)  
(optional)
(required)  
Remember Me?

Type the numbers and letters above:
Powered by Community Server (Non-Commercial Edition), by Telligent Systems