XenServer 6.5 - SR "Run out of space while coalescing."

anthonyh

I heard through @scottalanmiller (thanks again, BTW!) that there are some pretty active XenServer users here, so hopefully some light can be shed on this weird issue that hast just started happening for me.

I received the following alert multiple times from my XenServer pool today:

Field Value

Name: No space left on device
Priority: 3
Class: SR
Object UUID: 545839f5-e2fc-e972-9391-d5641a60a567
Timestamp: 20151017T15:06:49Z
Message UUID: 4cdd6b52-3343-fd64-bd37-e37a59b1a793
Pool name: MyPool
Body: Run out of space while coalescing.

I've checked the SR and according to XenCenter it is approximately 50% consumed:

I checked the SAN itself and it shows even less disk space consumed. In any rate, all signs show I have plenty of disk space.

Is there any way to get more detail behind what was going on when this happened? I looked at /var/SMlog and didn't see anything out of the ordinary, but maybe I need to look again.

Any ideas? Thanks!

scottalanmiller

Not a lot of reference posts out there. Here is one that I found as a starting point for research: http://discussions.citrix.com/topic/361389-run-out-of-space-while-coalescing/

scottalanmiller

Are there a lot of snaps in that pool?

scottalanmiller

Some interesting stuff in here: http://discussions.citrix.com/topic/347002-how-to-fix-vhd-coalesce-mess-in-xenserver-62

From the article, try some of these when this happens and let's see if there is something happening trying to clean things up:

tail (or cat, if you like) /var/log/SMlog | grep -i "exception"
tail (or cat, if you like) /var/log/SMlog | grep -i "coalesced"
tail (or cat, if you like) /var/log/SMlog | grep -i "kicking"
tail (or cat, if you like) /var/log/SMlog | grep "GC"

scottalanmiller

My guess is that there are many snaps and a ton of space is used during the coalesce phase and it then clears up before you are checking it so that it appears like there is a lot of space but it actually runs out at some point momentarily.

DustinB3403

Sorry did this ever get addressed? Did you check for any dis-associated vDisk that you can delete?

scottalanmiller

Oh yeah, I forgot that we were waiting for him to get back to us on this.

scottalanmiller

He had a thread on SW that did not get as much traffic but had two responses but went silent too...

http://community.spiceworks.com/topic/1242191-xenserver-6-5-sr-run-out-of-space-while-coalescing

He's not been on here for nine days.

scottalanmiller

Just checked, he went quiet on SW the same day, nine days ago. I sent him a PM, hope that all is okay.

scottalanmiller

I checked in, he's just been too busy with a new rollout to revisit this.

anthonyh

Hey all - So sorry for going dead! We implemented a new system that basically "runs the organization" and it's been a mad house. We consolidated 6 separate databases into one system, and even after a year of doing data migration tests, UAT, etc., it's still a bit of a mess (which we anticipated, fortunately). In any rate, I am here to update this thread on the issue...

It seems that the issue was caused by my backup software. Because the backup jobs were taking so long to process, for one magical moment, the software would have snapshots created for every single VM. This was the cause of the issue. After beefing up our backup server and adding SSDs for the metadata database (deduplication), backup windows are MUCH better, jobs are not overlapping as much as before, and this issue has not happened since.

Thanks everyone for the help! I'm hoping to be able to participate a bit more moving forward. We'll see how that goes...

scottalanmiller

Awesome, glad that you managed to find a solution!

JaredBusch

Looks like it was snapshots as @scottalanmiller was finding. Good to know.

anthonyh

@JaredBusch I suspected the same as well, but what was confusing was that they would be cleaned up by the time I would be able to troubleshoot. It seemed to be a very short moment where it would happen, as it seemed to happen long enough for XenCenter to alert, but not long enough to have any impact on the environment.

scottalanmiller

So many things use snapshots under the hood, space can grow and shrink pretty rapidly in confusing ways.