Dell Server Not Recognizing Memory
-
Here's a weird one. A new client with a Dell PE-R720XD SFF has 24 x 16GB sticks occupying every available slot in the server. As part of my inventory discovery work, I noticed that there are 6 slots that do not recognize the memory installed. Checked all modules and they are all the same Samsung ECC RDIMMs @ 16GB 1333Mhz memory so it isn't a compatibility thing. Spent a few hours moving modules around but appears as though the same slots are not recognizing memory regardless of which stick I have in there. It appears to be channel related because all unavailable slots are Processor 2 Channels 0 and 1...essentially 2 channels are not recognizing memory on that second processor.
The weird thing is that the server is running "perfectly". I add the quotes because while there are no errors and all VMs are working well with no degradation in performance, there is obviously an issue.
B1 = Processor 2 Channel 0
B2 = Processor 2 Channel 1
B5 = Processor 2 Channel 0
B6 = Processor 2 Channel 1
B9 = Processor 2 Channel 0
B10 = Processor 2 Channel 1To make sure I wasn't missing anything, I checked the manual and for 2 processor setups, the memory currently installed should work properly. I've also reseated every single module just in case.
There are absolutely no log entries indicating any issues with memory going back over a year and the server has been rebooted a number of times since I've been looking at the memory issue.
I've also run the Dell diagnostics utility on boot-up and everything checked out ok with a PASS on everything.
Before I start dismantling the server to diagnose, any thoughts as to what to test next?
These are the troublesome slots.
-
@NashBrydges said in Dell Server Not Recognizing Memory:
Samsung ECC RDIMMs @ 16GB 1333Mhz memory
Did you notice this is the manual?
NOTE: 16 GB quad-rank RDIMMs are not supported.
Are you able to determine the specific part number for these DIMMs?
-
@Danp said in Dell Server Not Recognizing Memory:
@NashBrydges said in Dell Server Not Recognizing Memory:
Samsung ECC RDIMMs @ 16GB 1333Mhz memory
Did you notice this is the manual?
NOTE: 16 GB quad-rank RDIMMs are not supported.
Are you able to determine the specific part number for these DIMMs?
I'd check all the small numbers on the DIMMs.
It's possible that someone screwed up and didn't notice.
6x16GB of RAM that is not working is a total of 96GB RAM that is missing. That's a significant amount of the servers total RAM.
It's also possible that one CPU is faulty. Extemely rare though but not impossible. I believe the DIMMs are connected directly to the CPUs internal memory controller.
It's a slightly odd memory configuration so it's not unlikely that it has been upgraded during it's lifetime. Normally it's better to only use 8 DIMMs per CPU and if you need more than 16x16GB use 32GB LRDIMMs instead. Can't mix RDIMMs and LRDIMMs though which is another way to screw up
-
@NashBrydges said in Dell Server Not Recognizing Memory:
I've also run the Dell diagnostics utility on boot-up and everything checked out ok with a PASS on everything.
The diagnosis utility can't test what the CPU can't recognize or find. So it's of limited value.
-
@Danp I did, yeah, no quad rank dimms.
-
@Pete-S That's what I also thought. I will have to spend some more time digging all the module numbers out tomorrow once I'm back there. There has to be something mismatched somewhere. Can't imagine anything else at this point.
-
@NashBrydges said in Dell Server Not Recognizing Memory:
The weird thing is that the server is running "perfectly". I add the quotes because while there are no errors and all VMs are working well with no degradation in performance, there is obviously an issue.
This is what to be expected when the CPU doesn't recognize the memory.
What you have is an one CPU with full memory bandwidth and 192GB of memory and the other CPU with 96GB memory and probably only half memory bandwidth. So the server is less performant than it would normally have been.
-
@NashBrydges said in Dell Server Not Recognizing Memory:
@Pete-S That's what I also thought. I will have to spend some more time digging all the module numbers out tomorrow once I'm back there. There has to be something mismatched somewhere. Can't imagine anything else at this point.
If possible you should be prepared to swap the CPUs.
What kind of CPUs are in there? E5-26xx V2 something perhaps? V1 is probably more likely.
Troubleshooting quickly adds up so it might be time to consider what to do if the problem can't be solved easily. Like looking at the RAM and reseating it.
R720 is well over it's expected life span at this point. It's very much a possibility that the server is on the verge of catastrophic failure and this is the first sign.
-
@Pete-S The modules have all been reseated and swapped around to other slots and still the same thing. The same 6 slots remain unidentified (or unoccupied according to iDrac).
The CPUs are E5-2650 v1.
I've already had the conversation with the owner. Looks like we're going to keep things as they are since everything is operating normally (with the obvious missing RAM). We have good tested backups with another server to migrate the workload to in under an hour should something fail. He's unwilling to spend the cash on a new server and a deep diagnosis will be pretty pricy to pay for my time so...status quo for now.
-
@NashBrydges Guess you can take a horse to water but you can't force him to drink.
-
@NashBrydges Did you try switching the positions of existing CPUs?