flomer

flomer

Thank you!

I just registered after finding this site when searching for an article you have written... I was a little surprised seeing it was so easy to register. Nice!

flomer

Hello again, community!

I have used VMware for years now, but have avoided becoming an expert ;-). I have seen how good it is at consolidating many servers onto one host, effectively dealing out lots of RAM while not consuming all of it in real life. One of my servers now has about 220 GB as granted, but only 125 GB (of 128) actually used. That is good.
Now, I have been starting experienting With XenServer for several reasons. One of them being that an old server was not able to install current version of ESX, but XenServer 6.5 installed fine. The server is more than 7 years old, but has 32 cores and 64 GB RAM, so it is too good to let go. Also, after adding a couple of other old servers I am able to do live (storage) vmotion for free. This makes XenServer attractive.
But, now I run into memory trouble. The one server I will use as example has 64 GB RAM. One VM has been given 32 GB RAM, and another 24 GB RAM. Now, trying to power on a third VM With 8 GB RAM fails, because I have used up all the RAM... Hm. That was an unpleasant surprise. Does this mean that XenServer cannot "overcommit" memory like ESX can? Or is there a setting I can change?

flomer

Hm... After making sure XenTools is installed in all the VMs on the server, I can actually see the RAM being decreased for the already powered on VMs as I power on the last VM -- it works as expected now that all VMs have XenTools installed. Thanks!

flomer

Hello!

I have a DL380 G7 file server with FreeNAS 9.2. The data that is shared (CIFS, NFS) is on a RAID 6 made up of 6 x 3 TB MDL disks attached to a Smart Array p410i. Now, last week the disk started to get really slow. I had to shut down multiple VMs and move them off the server in order for my users to get to do any work. I had to move the VMs over nights several days in a row... The read rate of the drive was about 3 MB/s... All green lights blinking on the front of the machine. I was really puzzled... I wondered if perhaps the battery for the cache had stopped working. I have had no problems with this setup earlier (has been running for at least 10 months), but discovered now that FreeNAS/FreeBSD is not well supported by HP (or the other way around), so I could not get any info from inside FreeNAS. No software to inspect the status of the RAID. When all VMs and critical data was moved off of the server I could finally reboot it and run the HP SmartStart CD. I ran a short diagnostic test, and .... one of the drives was marked as Failed, due to too many read/write errors. Now, that explains it all. I replaced the disk, waited 12 (!) hours for the rebuild to finish, and now read/write is up to 250-400 MB/s. All is well.

So, the question remains (and I have sent this to my HP dealer);
Why was the drive not clearly (via the LEDs) marked as bad, and kicked from the RAID?

Has anyone ever seen this behaviour?

flomer

I live in Norway and it's getting late here. I will return to this thread tomorrow!

flomer

Dear community!
I have two old AMD servers, one with 8 Quad-Core AMD Opteron 8347 @ 1908 MHz and 64 GB RAM and the other one with 4 Quad-Core Opteron 8393 SE @ 3100 MHz and 32 GB RAM. I managed to add them to the same pool in XenServer 6.5, so I guess the CPUs are compatible (enough). I am curious as to what has been done in order to mask the differences between the CPUs. I live migrated a VM from one to the other (8347 to 8393), and I noticed that the VM was reporting the old clock speed for the CPU... Does this mean that the faster CPU of the second server will look to the VMs as having the same specs as the older 8347, or will this cahnge when I reboot the VM? Is it better from a performance point of view to keep the two servers separated, not in a pool?

flomer

Hm. So you are saying that a VM that is mirgated form the 1,9 GHz host to the 3,1 GHz host will actually run faster, even though it will not report the correct speed until a reboot? In that case it will be perfectly fine to pool the hosts together, for added flexibility and possible HA-fun

flomer

Well, it's on our lab network, and I am playing with and learning XenServer at the moment. I might move some of our production servers from vSphere to XenServer if things work out nicely.

flomer

Yes. At the moment we are having two different Essentials Plus environments, and I must say I am at times very frustrated about the limitations. I mean, why not 5 hosts instead of just three (6 CPUs)?? We are a rather small SMB-type department in a large organization, and I see that XenServer will give us things like live storage vmotion for free, whereas for vSphere I have to shut down the VMs... I am also playing with View, and thinking about virtualizing a few workstations that have powerful GPUs -- for that to work we need full licenses in order for vGPU to work, and this will be prohibitively expensive when we are just talking about 5 users... I feel sometimes that we are cought in the middle. We are a business, but have a small budget. Sigh...

flomer

@scottalanmiller I will try to investigate this and try to get in touch with our customer's IT dept. early in the process for our next project. It will be interesting to see what will happen. Already it is quite puzzling that most customers want us to deliver the hardware, rather than just providing a server to us, or ask us for a VM. I guess that should be an indication the the instrumentation part of offshore business is a little "special" when it comes to these things.

flomer

Thanks for your replies!

It's tricky for me to change to the same OS for the three clusters, since they are all in production... I barely managed to get the users to do some benchmarking before they started using the newest cluster full time... I guess I can take one node from each cluster and install CentOS 7 and test again, but then I will need another program to test with. Any ideas? Also, I cannot use CentOS 6, since I guess the newest CPUs will not be recognized.

As for BIOS settings, I have followed the vendor's recommendations. When it comes to help from Supermicro, it has been very little so far. They told me to check out the tips given in a paper made by Intel and Ansys, but that did not give me much insight. It's more like a promotional paper; try searching for "Higher Performance Across a Wide Range of ANSYS Fluent Simulations with the Intel Xeon Scalable Gold 6148 Processor". The only comment that I found useful was "The improved vectorization, which is available as a runtime option in Fluent 18.1, was used in these benchmarks." I am pretty sure that version 19.2 of Fluent is aware of the Scalable processor features.

Anyway, since the core speed of 6146 and E5-2667 v4 is the same, I would think that it should at lest be equally fast. The faster memory (and we have populated all channels, as recommended) is supposed to be very important for Fluent, and this should make the 6146 cluster faster...

Do you know of any test program I can use to test single core performance, and also multicore performance, in a simple way from the command line? I can run this on one node on all clusters and compare. And if you know of any contact person or division in Supermicro that I can contact bout this matter?

flomer

Hi!

Is Intel Xeon Scalable really better than the E5? In real life?

I manage three small HPC clusters. We use them to run Ansys Fluent 19.2 simulations. Rough specs for the three clusters are as follows:

#1 - Purchased in 2013/2014, 128 cores, with nodes having two Intel Xeon E5-2643 v1 CPUs each
#2 - Purchased in 2016, 256 cores, with nodes having two Intel Xeon E5-2667 v4 CPUs each
#3 - Purchased March 2019, 544 cores, with nodes having two Intel Xeon Scalable 6146 v1 CPUs each

All systems use Supermicro hardware.

All three clusters have InfiniBand network between the nodes and head node, and scale very well. By scale I mean that the performance of a job almost linearly scale up as we use more cores. Fast disk access is supposedly not a big concern for Ansys Fluent, more so speed of cores and memory. All nodes have local SSD drives. All resulting data is written over NFS via ordinary Gb ethernet to head node, that has a RAID6 with SSD cache. I have never been able to see that disk access is a bottleneck for the calculations. The only times I notice that the system is spending time on "system" rather than "user", is when someone accidentally starts a job using Gb interconnect rather than InfiniBand. So, it seems to me that the only bottle neck is the speed of the cores.

When we purchase a new cluster we have of course compared the performance of the new cluster against the old cluster, using jobs that fit on both of them, e.g. various jobs up to 128 core jobs, since the oldest cluster has 128 cores. In this way we feel that we have been comparing them in a "fair" way.

Cluster #2 is about 20% faster than cluster #1. The 2643s CPUs have a core speed of 3.3 GHz versus 3.2 GHz for the newer 2667s. Now, I have explained to the users that the newer cluster (#2) is faster than the oldest one (#1) because of faster memory (1600 MHz for #1 and 2400 MHz for #2) and generally better and faster architecture in the newer cluster. The InfiniBand interconnect is also faster (40 versus 56 Gb), but I think this does not matter. So, even if the oldest cluster has the fastest cores, everything else is faster.

Our newest cluster have CPUs with core speed 3.2 GHz, has a generally newer and better/faster architecture and has faster memory (2666 MHz). Now, based on the difference in speed between #1 and #2, I made a guess that the newest cluster (#3) should be at least 20% faster than cluster #2, based on the reasoning above about being newer and better, and having faster memory. Faster, even if the clock speed remains like #2 (#2 has lower clock speed but is still faster than #1). And the InfiniBand for #3 being 100 Gb might also contribute.

But, no! Cluster #3 is actually slower than cluster #2... Cluster #2 is about 10% faster than than the newer and presumably better cluster #3...

So, I wonder... Have anybody else seen this? Does anybody else have real world examples and cases like ours? Is Intel Scalable generally faster than E5 v4 in the real world? We find that our E5 v4 cluster is actually fastest...

If I have set something up the wrong way I would be delighted if someone can point this out for me. How can I check if something is wrong? All three systems run the Rocks cluster distribution (CentOS with extras), #1 version 6.1, #2 version 6.2 and #3 version 7.0.

I am both puzzled and disappointed right now... I hope you gurus can help us out!

flomer

Well, it's not so simple for us. We usually have no remote support, and it's up to the customer's IT dept. to take care of things once the system leaves our premises. After commissioning we often don't even have admin rights. It's really puzzling, but we often feel that we abandon the systems once the customer takes over. Our company do have service personnel on the customer's site regularly, but that is for other tasks, and not to service our system. We have had cases where we have been told that "the machine beeps, it's been like that for 3 months"... I guess you could say that someone, somewhere has a job to do. I will try to investigate this when we get our next project. We are often a sub contractor, and we rarely speak to the end user or the IT staff supporting him in the early stages of a project.

flomer

@scottalanmiller I will try to investigate this and try to get in touch with our customer's IT dept. early in the process for our next project. It will be interesting to see what will happen. Already it is quite puzzling that most customers want us to deliver the hardware, rather than just providing a server to us, or ask us for a VM. I guess that should be an indication the the instrumentation part of offshore business is a little "special" when it comes to these things.

flomer

Well, RAID is often a requirement, and will be specified in the functional description. The idea is of course that it adds to the uptime, and that a failed drive does not make the "instrument" usable. Then again, when a drive fails it will most probably go unnoticed (the server is isolated and out of our reach), and it might perhaps be noticed if service personnel from our company is on the premises for other reasons. We are quite appalled by how our customer treats some of this equipment (some never take back-ups), but it's difficult for us to change this. Our servers are often small add-ons to tons of steel that make up the rest of the multi-million dollar deliveries. The value of our systems (production monitoring) is often not realized until after the field is in production.

But, seriously, how would you go about installing the hypervisor. On an internal SATA-DOM or USB, and who will administer the hypervisor. We are often only allotted an IP-address. I can see that introducing the hypervisor with an IP in addition will make the IT guyes freak out. Then again, perhaps not. They might be all for it. I will try to investigate this on our next project.

flomer

OK, bad choice of words, then. Perhaps backwards would have been better ;-). Anyway, the VM layer will introduce a layer that will need a little bit of training, and it might seem more complicated to our customers. Perhaps I will try to suggest we go virtual on our next project. I guess there will be Hyper-V available for free with the WS 2012 OS, or am I wrong?

flomer

@scottalanmiller
I agree ;-). But, we are in the offshore industry, and our machine(s) are regarded as instrumentation, and our segment of the industry is ... slow changing. I think that what you say might one day be the case, though. We have other customers that are moving to VMs these days, but that is for our software doing calculations only. In this particular instance we are also in the area of collecting data, and this is the subsea segment, not topside. The cusomers are more conservative on the subsea side.

flomer

Sounds good! I am thinking it would be like this, but thought it was better to hear if anybody have experienced problems before promising to help the other guys out. Yes, it is Windows, but I assume the other company will have the necessary info to re-activate the OS on their machine. Thank you for your reply!

flomer

Dear community!

We have a customer that have purchased our software. We deliver our software on a HP DL380 Gen9 that I have set up and installed OS and software on.

The customer have another supplier located close to us that also have supplied them with a piece of software on an identical type of server. This servers has been sent to Singapore for a test. When it arrived at the destination, all the 8 hard drives had disappeared... Now, they can of course have the machine returned to them and put in new drives and reinstall everything and ship it to the test site again. This will add weeks to the time spent and will be bad for our customer.

But, our customer suggested that this other supplier come to our premises with a set of drives and borrow our server in order to create RAID on the drives, install the OS and put in their software. Then they can ship only the drives to Singapore.

We have no problem with this solution seen from an administrative view. This will be helpful to our customer and this other supplier is no competitor of ours.

Now, what I am wondering about is this; if I power down our server and remove all the drives from the machine, am I risking running into problems of any kind? When this other party is done borrowing our server (or more preciely, our customer's machine) can I simply just put back our drives in the machine and everything will be like before? Will our RAID controller be confused about having the other new drives in the bays for a few hours? Will the drives that are shipped to Singapore actually work in their server there?

If we don't disrupt our server doing the above, this will be a solution that is a win-win-win situation. No problem for us, very good for our customer and very good for the other supplier

flomer

Well, the main application itself, is a "server" that is started automatically as a service. It gathers data and performs calculations based on the input. Data may be exported, but not always, but stored in proprietary databases. The user interface comes up by way of Interactive Services Detection, and is a bit of a pain... The application is being rewritten as we speak and will use HTML and a browser for GUI in the next version. BUT, the customer is only allowing RDP traffic to the server, not http, so...

flomer

@flomer

Best posts made by flomer

Latest posts made by flomer