Performance of Intel Xeon Scalable 6146 versus E5-2667 v4 in the real world...

flomer

Hi!

Is Intel Xeon Scalable really better than the E5? In real life?

I manage three small HPC clusters. We use them to run Ansys Fluent 19.2 simulations. Rough specs for the three clusters are as follows:

#1 - Purchased in 2013/2014, 128 cores, with nodes having two Intel Xeon E5-2643 v1 CPUs each
#2 - Purchased in 2016, 256 cores, with nodes having two Intel Xeon E5-2667 v4 CPUs each
#3 - Purchased March 2019, 544 cores, with nodes having two Intel Xeon Scalable 6146 v1 CPUs each

All systems use Supermicro hardware.

All three clusters have InfiniBand network between the nodes and head node, and scale very well. By scale I mean that the performance of a job almost linearly scale up as we use more cores. Fast disk access is supposedly not a big concern for Ansys Fluent, more so speed of cores and memory. All nodes have local SSD drives. All resulting data is written over NFS via ordinary Gb ethernet to head node, that has a RAID6 with SSD cache. I have never been able to see that disk access is a bottleneck for the calculations. The only times I notice that the system is spending time on "system" rather than "user", is when someone accidentally starts a job using Gb interconnect rather than InfiniBand. So, it seems to me that the only bottle neck is the speed of the cores.

When we purchase a new cluster we have of course compared the performance of the new cluster against the old cluster, using jobs that fit on both of them, e.g. various jobs up to 128 core jobs, since the oldest cluster has 128 cores. In this way we feel that we have been comparing them in a "fair" way.

Cluster #2 is about 20% faster than cluster #1. The 2643s CPUs have a core speed of 3.3 GHz versus 3.2 GHz for the newer 2667s. Now, I have explained to the users that the newer cluster (#2) is faster than the oldest one (#1) because of faster memory (1600 MHz for #1 and 2400 MHz for #2) and generally better and faster architecture in the newer cluster. The InfiniBand interconnect is also faster (40 versus 56 Gb), but I think this does not matter. So, even if the oldest cluster has the fastest cores, everything else is faster.

Our newest cluster have CPUs with core speed 3.2 GHz, has a generally newer and better/faster architecture and has faster memory (2666 MHz). Now, based on the difference in speed between #1 and #2, I made a guess that the newest cluster (#3) should be at least 20% faster than cluster #2, based on the reasoning above about being newer and better, and having faster memory. Faster, even if the clock speed remains like #2 (#2 has lower clock speed but is still faster than #1). And the InfiniBand for #3 being 100 Gb might also contribute.

But, no! Cluster #3 is actually slower than cluster #2... Cluster #2 is about 10% faster than than the newer and presumably better cluster #3...

So, I wonder... Have anybody else seen this? Does anybody else have real world examples and cases like ours? Is Intel Scalable generally faster than E5 v4 in the real world? We find that our E5 v4 cluster is actually fastest...

If I have set something up the wrong way I would be delighted if someone can point this out for me. How can I check if something is wrong? All three systems run the Rocks cluster distribution (CentOS with extras), #1 version 6.1, #2 version 6.2 and #3 version 7.0.

I am both puzzled and disappointed right now... I hope you gurus can help us out!

marcinozga

@flomer said in Performance of Intel Xeon Scalable 6146 versus E5-2667 v4 in the real world...:

If I have set something up the wrong way I would be delighted if someone can point this out for me. How can I check if something is wrong? All three systems run the Rocks cluster distribution (CentOS with extras), #1 version 6.1, #2 version 6.2 and #3 version 7.0.

I don't have any experience with HPC, but based on the above it seems that linux kernel version might be the issue. Centos 6 comes with 2.6.32 kernel, and 7 with 3.10. Either test all 3 clusters on same kernel line, or do some research if there was any performance drop between kernel versions above.

Dashrender

@marcinozga said in Performance of Intel Xeon Scalable 6146 versus E5-2667 v4 in the real world...:

@flomer said in Performance of Intel Xeon Scalable 6146 versus E5-2667 v4 in the real world...:

If I have set something up the wrong way I would be delighted if someone can point this out for me. How can I check if something is wrong? All three systems run the Rocks cluster distribution (CentOS with extras), #1 version 6.1, #2 version 6.2 and #3 version 7.0.

I don't have any experience with HPC, but based on the above it seems that linux kernel version might be the issue. Centos 6 comes with 2.6.32 kernel, and 7 with 3.10. Either test all 3 clusters on same kernel line, or do some research if there was any performance drop between kernel versions above.

I too was wondering if running different OS versions was the root of your issue?

travisdh1

@marcinozga said in Performance of Intel Xeon Scalable 6146 versus E5-2667 v4 in the real world...:

@flomer said in Performance of Intel Xeon Scalable 6146 versus E5-2667 v4 in the real world...:

If I have set something up the wrong way I would be delighted if someone can point this out for me. How can I check if something is wrong? All three systems run the Rocks cluster distribution (CentOS with extras), #1 version 6.1, #2 version 6.2 and #3 version 7.0.

I don't have any experience with HPC, but based on the above it seems that linux kernel version might be the issue. Centos 6 comes with 2.6.32 kernel, and 7 with 3.10. Either test all 3 clusters on same kernel line, or do some research if there was any performance drop between kernel versions above.

Managing cores above a certain number becomes difficult. Linus himself used to complain that managing more than around 16 cores required an entire core just for the scheduler. They've improved things a bit, but high numbers of cores will always require more work to manage right.

Running in an HPC environment, you'll also have to pay attention to things like program size (does it fit into L1/L2 cache), dataset size (does it fit into either L3 or available RAM).

I'd suspect that even with the faster RAM, getting data in and out of each core could be slowing things down. Many more cores and only slightly faster RAM would be one choke point to investigate.

This is really one of the oddball use cases where servers are running, but not in a virtualized environment. That's what most server hardware is designed around these days. You could have any number of performance choke points.

marcinozga

@travisdh1 said in Performance of Intel Xeon Scalable 6146 versus E5-2667 v4 in the real world...:

@marcinozga said in Performance of Intel Xeon Scalable 6146 versus E5-2667 v4 in the real world...:

@flomer said in Performance of Intel Xeon Scalable 6146 versus E5-2667 v4 in the real world...:

If I have set something up the wrong way I would be delighted if someone can point this out for me. How can I check if something is wrong? All three systems run the Rocks cluster distribution (CentOS with extras), #1 version 6.1, #2 version 6.2 and #3 version 7.0.

I don't have any experience with HPC, but based on the above it seems that linux kernel version might be the issue. Centos 6 comes with 2.6.32 kernel, and 7 with 3.10. Either test all 3 clusters on same kernel line, or do some research if there was any performance drop between kernel versions above.

Managing cores above a certain number becomes difficult. Linus himself used to complain that managing more than around 16 cores required an entire core just for the scheduler. They've improved things a bit, but high numbers of cores will always require more work to manage right.

Running in an HPC environment, you'll also have to pay attention to things like program size (does it fit into L1/L2 cache), dataset size (does it fit into either L3 or available RAM).

I'd suspect that even with the faster RAM, getting data in and out of each core could be slowing things down. Many more cores and only slightly faster RAM would be one choke point to investigate.

This is really one of the oddball use cases where servers are running, but not in a virtualized environment. That's what most server hardware is designed around these days. You could have any number of performance choke points.

I don't think that's the problem here, as he's running jobs on 128 cores, so having more of them shouldn't matter as they will sit idle.

travisdh1

@marcinozga said in Performance of Intel Xeon Scalable 6146 versus E5-2667 v4 in the real world...:

@travisdh1 said in Performance of Intel Xeon Scalable 6146 versus E5-2667 v4 in the real world...:

@marcinozga said in Performance of Intel Xeon Scalable 6146 versus E5-2667 v4 in the real world...:

@flomer said in Performance of Intel Xeon Scalable 6146 versus E5-2667 v4 in the real world...:

If I have set something up the wrong way I would be delighted if someone can point this out for me. How can I check if something is wrong? All three systems run the Rocks cluster distribution (CentOS with extras), #1 version 6.1, #2 version 6.2 and #3 version 7.0.

I don't have any experience with HPC, but based on the above it seems that linux kernel version might be the issue. Centos 6 comes with 2.6.32 kernel, and 7 with 3.10. Either test all 3 clusters on same kernel line, or do some research if there was any performance drop between kernel versions above.

Managing cores above a certain number becomes difficult. Linus himself used to complain that managing more than around 16 cores required an entire core just for the scheduler. They've improved things a bit, but high numbers of cores will always require more work to manage right.

Running in an HPC environment, you'll also have to pay attention to things like program size (does it fit into L1/L2 cache), dataset size (does it fit into either L3 or available RAM).

I'd suspect that even with the faster RAM, getting data in and out of each core could be slowing things down. Many more cores and only slightly faster RAM would be one choke point to investigate.

This is really one of the oddball use cases where servers are running, but not in a virtualized environment. That's what most server hardware is designed around these days. You could have any number of performance choke points.

I don't think that's the problem here, as he's running jobs on 128 cores, so having more of them shouldn't matter as they will sit idle.

Uhm, if people are creating jobs that only use 128 cores on the new cluster, I know what his issues is, and it's not hardware!

marcinozga

He want's to compare core to core performance, nothing wrong with that. Newer CPUs should yield better performance, especially at same or very close clock speeds. That's not the case for him.

Dashrender

@marcinozga said in Performance of Intel Xeon Scalable 6146 versus E5-2667 v4 in the real world...:

He want's to compare core to core performance, nothing wrong with that. Newer CPUs should yield better performance, especially at same or very close clock speeds. That's not the case for him.

What's changed? He said the clock speed is the same, he said the RAM is faster and the Infiniband is faster. If RAM speed and infiniband don't affect it - then why would you expect it to be faster? Of course, the OP said he believed that RAM does affect it.. so he might gain something from the faster RAM... so I agree, in general, they should be the same, if not a tick faster, but definitely not slower.

travisdh1

@Dashrender said in Performance of Intel Xeon Scalable 6146 versus E5-2667 v4 in the real world...:

@marcinozga said in Performance of Intel Xeon Scalable 6146 versus E5-2667 v4 in the real world...:

He want's to compare core to core performance, nothing wrong with that. Newer CPUs should yield better performance, especially at same or very close clock speeds. That's not the case for him.

What's changed? He said the clock speed is the same, he said the RAM is faster and the Infiniband is faster. If RAM speed and infiniband don't affect it - then why would you expect it to be faster? Of course, the OP said he believed that RAM does affect it.. so he might gain something from the faster RAM... so I agree, in general, they should be the same, if not a tick faster, but definitely not slower.

Not necessarily. There is obviously something that is hampering performance on the new CPU. What that could be I really don't have time to burn in reading through the datasheets for the CPUs. Could be cache, could be memory bus issues with memory access having to take a longer path than what is optimal (I got to deal with that issue when working with SGI systems back in the day. They actually had special tools to tell the kernel which memory banks to prefer for which CPU.)

1337

If we compare the E5-2667 v4 and the 6146 v1, the technological difference is nowhere near as large as between the E5-2643 V1 and E5-2667 v4.

The Gold 6146 was introduced mid 2017 while the E5-2667 v4 was mid 2016. Both use 14nm. Both use DDR4. The 6146 has 12 cores over the E5-2667v4 8 cores but it doesn't have the TDP to match that so it can probably not run as many cores in turbo at the same time as the E5-2667v4.

So all said and done, I'm not sure the 6146 actually is faster per core when all cores are under very high load. When you run a single core task 6146 should be faster but then the CPU has the ability to go into the highest turbo frequency on that core.

There is also the whole meltdown/spectre problems that have had a huge negative effect on performance. So if the new systems have patches for this and the old one don't, it could put the new systems at a big disadvantage.

Small things like BIOS settings can make a big difference when you are tuning for maximum performance. If you look at HPE and Dell for instance they have recommendations on BIOS settings for HPC applications that go against common IT practice but are what is needed for maximum performance.

Have you been in contact with Supermicro and their HPC guys? I think you need experts to tune your cluster to maximum performance.

Dashrender

@Pete-S said in Performance of Intel Xeon Scalable 6146 versus E5-2667 v4 in the real world...:

There is also the whole meltdown/spectre problems that have had a huge negative effect on performance. So if the new systems have patches for this and the old one don't, it could put the new systems at a big disadvantage.

Great point - all the more reason to be testing the same OS on each platform (versions and patch level) because the new one might have those updates while the older ones don't.

flomer

Thanks for your replies!

It's tricky for me to change to the same OS for the three clusters, since they are all in production... I barely managed to get the users to do some benchmarking before they started using the newest cluster full time... I guess I can take one node from each cluster and install CentOS 7 and test again, but then I will need another program to test with. Any ideas? Also, I cannot use CentOS 6, since I guess the newest CPUs will not be recognized.

As for BIOS settings, I have followed the vendor's recommendations. When it comes to help from Supermicro, it has been very little so far. They told me to check out the tips given in a paper made by Intel and Ansys, but that did not give me much insight. It's more like a promotional paper; try searching for "Higher Performance Across a Wide Range of ANSYS Fluent Simulations with the Intel Xeon Scalable Gold 6148 Processor". The only comment that I found useful was "The improved vectorization, which is available as a runtime option in Fluent 18.1, was used in these benchmarks." I am pretty sure that version 19.2 of Fluent is aware of the Scalable processor features.

Anyway, since the core speed of 6146 and E5-2667 v4 is the same, I would think that it should at lest be equally fast. The faster memory (and we have populated all channels, as recommended) is supposed to be very important for Fluent, and this should make the 6146 cluster faster...

Do you know of any test program I can use to test single core performance, and also multicore performance, in a simple way from the command line? I can run this on one node on all clusters and compare. And if you know of any contact person or division in Supermicro that I can contact bout this matter?

1337

@flomer I would say this is a post-sales matter so I would go that route. You bought something and it is not performing to your expectations. Is it a hardware problem, software configuration or did you buy the wrong thing? It doesn't matter - you paid big money for it so get the vendor to sort it out.

Dashrender

@Pete-S said in Performance of Intel Xeon Scalable 6146 versus E5-2667 v4 in the real world...:

@flomer I would say this is a post-sales matter so I would go that route. You bought something and it is not performing to your expectations. Is it a hardware problem, software configuration or did you buy the wrong thing? It doesn't matter - you paid big money for it so get the vendor to sort it out.

Really? I get the feeling that @scottalanmiller would fully disagree with you - unless you know the system was able to do what you needed/expected before you bought it. he's traveling so much though, he might not get back to this thread for days or more.

scottalanmiller

@flomer said in Performance of Intel Xeon Scalable 6146 versus E5-2667 v4 in the real world...:

But, no! Cluster #3 is actually slower than cluster #2... Cluster #2 is about 10% faster than than the newer and presumably better cluster #3...
So, I wonder... Have anybody else seen this? Does anybody else have real world examples and cases like ours? Is Intel Scalable generally faster than E5 v4 in the real world? We find that our E5 v4 cluster is actually fastest...

Processor speeds are not a set in stone thing. It depends on the workload. The only way to really know is to test before you buy or to buy things that match in all but one way. The newer procs you have there have smaller cache total with more cores, so way lower per-core cache.

That one factor, depending on how your workload is configured, may make all of the difference. More cores only matters if you are thread-bound, and if you have spare cache overhead (in this case.) Base frequency is the same, and turbo may or may not matter for you.

So while for a totally average workload we might expect the newer, more expensive processor to be generally faster, for any given workload it is clear that it is a "it depends" comparison.

scottalanmiller

@Pete-S said in Performance of Intel Xeon Scalable 6146 versus E5-2667 v4 in the real world...:

@flomer I would say this is a post-sales matter so I would go that route. You bought something and it is not performing to your expectations. Is it a hardware problem, software configuration or did you buy the wrong thing? It doesn't matter - you paid big money for it so get the vendor to sort it out.

That's not at all how it works. Try buying a Chevy car and going to the dealer "I paid a lot of money and it doesn't haul eight people, you have to fix it." Clearly it was your job to evaluate your workload and determine your needs. And "a lot of money" is subjective.

There is no one involved in this equation responsible for knowing the workload and performance characteristics except for the end user. Unless the procs aren't working correctly, the percentage of responsibility on the vendor would be zero.