Xenserver Space Woes

jrc

Did you try using vgs and vhd-util, to see if the hidden=1 is on for your troubled vhd?
THere is also an XS6.5 update that was just released (well just showed up in my XC the other day), XS65ESP1029, with the description "fixes to Storage and Dom0 kernel modules"

I also just noticed your post yesterday in this thread about "Run out of space while coalescing".

I still think this is something to do with you Unitrends failures and snapshotting.
At this point i think all you can really do is reboot and/or delete those snapshots-but-not-snapshots, or just move the disk to another SR. However, befopre you do any of this, you should do something to save the data that could go missing.
I would robocopy /mir /copy:DATSO the shares on that virtual disk drive somewhere else just in case you wipe outthe virtual disk. I keep a 4TB usb3 drive hooked up to my desktop just for things like this.

Unfortunately I don't have a place to dump 2Tb of home folders, and even if I did, I don't have the luxury of taking everyone's folder offline for the amount of time that would take.

The VGS and VHD tool did yield an entire list of VHDs and showed me their relationships, but I can't seem to correlate the 2 VHDs to any of the vhd-util output, though they do appear to be gone.

I am thinking that Unitrends is definitely behind this odd issue. So I have turned off all the backup jobs for now, so once the job currently running (backup of this VM with the 2TB drive) is complete, I'll run all these commands again and see what it looks like. Plus I reboot and update all the Xen hosts before I turn them back on.

jrc

So this weekend I updated my Xen hosts, and re-booted them both. So everything is 100% up to date. The reboot did not, however, fix my space issue. I am still using 2Tb more than I should. I also turned off Unitrends for the last few days. So I am 100% sure that I do not have any snapshots on there, at least none that show up in Xencenter.

Xencenter reports that I am using 6173.8Gb but only have 4115.7 allocated, 2 TB more used then there should be.

Now when I run xe vdi-list is-a-snapshot=true I get:

           uuid ( RO) : 5535a3db-da4f-4211-afa8-077241f63221
      name-label ( RW): Staff Home
name-description ( RW): VDI for staff home folders
         sr-uuid ( RO): 4558cecd-d90d-3259-7ea5-09478d0e386c
    virtual-size ( RO): 2,193,654,546,432
        sharable ( RO): false
       read-only ( RO): true

So I tried to delete this VDI with xe vdi-destroy uuid=5535a3db-da4f-4211-afa8-077241f63221, I get:

This operation cannot be performed because the system does not manage this VDI
vdi: 5535a3db-da4f-4211-afa8-077241f63221 (Staff Home)

A reclaim freed space does not make a difference (and only takes about 5 seconds to run).

So any suggestions on where I can go from here? Moving this VM to other storage and then back is not really an options, since I can't take this VM down for the time it would take to move all 2TB (hours and hours), since this is where all my user's home folders are.

DustinB3403

You'll have to change the VDI from read-only: true to false before you can edit it.

Is that disk not supposed to be there though?

jrc

@DustinB3403 said in Xenserver Space Woes:

You'll have to change the VDI from read-only: true to false before you can edit it.

Is that disk not supposed to be there though?

I am increasingly of the opinion that it should not be there. I have 2 VDIs called "Staf Home" one is the actual VDI connected to the VM, the other is this one. This VDI is also the exact size of the discrepancy in space used and space allocated. But I am completely open to any and all suggestions on how I can confirm that this VDI is in fact just wasted space.

Danp

Can you post the results of xe vdi-list showing both VDIs? I'm wondering if one VDI is acting as a base copy for the other.

jrc

@Danp said in Xenserver Space Woes:

Can you post the results of xe vdi-list showing both VDIs? I'm wondering if one VDI is acting as a base copy for the other.

Bad? One:

uuid ( RO): 5535a3db-da4f-4211-afa8-077241f63221
name-label ( RW): Staff Home
name-description ( RW): VDI for staff home folders
sr-uuid ( RO): 4558cecd-d90d-3259-7ea5-09478d0e386c
virtual-size ( RO): 2193654546432
sharable ( RO): false
read-only ( RO): true

Good one:

uuid ( RO): 6255caa0-e7d4-4d27-a257-b33aaf3a7507
name-label ( RW): Staff Home
name-description ( RW): VDI for staff home folders
sr-uuid ( RO): 4558cecd-d90d-3259-7ea5-09478d0e386c
virtual-size ( RO): 2193654546432
sharable ( RO): false
read-only ( RO): false

EDIT: Maybe I am barking up the wrong SR-VDI here, since I ran the vhd-util command and got:

vhd=VHD-f832866c-1bb4-48d5-81e7-4dd468b2618b capacity=2,193,654,546,432 size=2,197,689,466,880 hidden=1 parent=none
vhd=VHD-5535a3db-da4f-4211-afa8-077241f63221 capacity=2,193,654,546,432 size=14,424,211,456 hidden=1 parent=VHD-f832866c-1bb4-48d5-81e7-4dd468b2618b
vhd=VHD-6255caa0-e7d4-4d27-a257-b33aaf3a7507 capacity=2,193,654,546,432 size=2,197,945,319,424 hidden=0 parent=VHD-5535a3db-da4f-4211-afa8-077241f63221

That seems to imply that the "good" vdi is a child of the "bad" vdi, which is itself a copy of the base VDI. Which would seem to be "normal" but still, where is that extra 2Tb going? And why can't I free it up?

momurda

This issue is fascinating.
Here is an article from Citrix, the answer is probably here, though at this time it is a bit over my head.
http://support.citrix.com/article/CTX201296
This discusses coalescing, and reasons for failure and steps to troubleshoot and fix the coalescing issues.
There seem to be 8 possible issues for this happening automatically.
/var/log/SMlog probably has more info about the problem according to this.
Also, are you able to move the SR(which will automatically get rid of ss chains) or export the vm and delete it, then import it?
I also think that any of these solutions require you to have sufficient free space on the SR.

jrc

@momurda said in Xenserver Space Woes:

This issue is fascinating.
Here is an article from Citrix, the answer is probably here, though at this time it is a bit over my head.
http://support.citrix.com/article/CTX201296
This discusses coalescing, and reasons for failure and steps to troubleshoot and fix the coalescing issues.
There seem to be 8 possible issues for this happening automatically.
/var/log/SMlog probably has more info about the problem according to this.
Also, are you able to move the SR(which will automatically get rid of ss chains) or export the vm and delete it, then import it?
I also think that any of these solutions require you to have sufficient free space on the SR.

Browsing through /var/log/SMlog does not really show anything obvious. I can see where it is doing some thing with the three VDIs previously mentioned, but it looks like that was a success. Yet I continue to be using 2Tb more than is virtually assigned.

I am going to dig through that support doc you linked and see if I can work anything out.

jrc

I think I may have worked it out. It would appear that the online coalesce for the VM in question keeps timing out on the specific VDI in question (the 6255... one), they go on to say this might be due to heavy load on the storage at the time it tries. I do not think this is the case here, but the suggested solution is to shut it down and do an offline coalesce with the command:

xe host-call-plugin host-uuid=<UUID of the pool master Host> plugin=coalesce-leaf fn=leaf-coalesce args:vm_uuid=<uuid of the VM you want to coalesce>

I am going to try this tonight and see what happens.

A side question: How does one work out: 1. If your storage is too slow? and 2. What is the IOP speed your storage is capable of?

momurda

In XenCenter, if your Xenserver is up to date with all hotfixes, you can use the performance tab in XC on the XS host to measure disk performance (read/write/total iops, queue length for each SR or vd) and you should get accurate results. If you dont have the hotfixes installed, you prob will not get accurate results.

In general longer queue lengths mean the disk cant keep up with what it is being asked to do.
You can also query performance from the cli using iostat.

jrc

@momurda said in Xenserver Space Woes:

In XenCenter, if your Xenserver is up to date with all hotfixes, you can use the performance tab in XC on the XS host to measure disk performance (read/write/total iops, queue length for each SR or vd) and you should get accurate results. If you dont have the hotfixes installed, you prob will not get accurate results.

In general longer queue lengths mean the disk can't keep up with what it is being asked to do.
You can also query performance from the cli using iostat.

Cool, I created a graph and added Disk IO Wait and Disk Queue size, but there appears to be no data (the hosts are completely up to date as of this weekend). I do note that on the standard Disk Performance graph there is not too much activity, over the last few days it's topped out at around 0.33MBps.

I guess I'll check in on it over the next few days and see what it looks like, but I don't think I'm having disk performance issues.

BRRABill

@momurda said

In XenCenter, if your Xenserver is up to date with all hotfixes,

Is it the hotfixes, or the XS Tools? I know the tools have to be installed to run some of the stuff. (Like memory.)

jrc

@BRRABill said in Xenserver Space Woes:

@momurda said

In XenCenter, if your Xenserver is up to date with all hotfixes,

Is it the hotfixes, or the XS Tools? I know the tools have to be installed to run some of the stuff. (Like memory.)

Good point. The tools are not up to date. So I'll need to update them tonight, though I am looking at historical data from before I applied SP1 and the other updates.

momurda

You can also throw some io at a disk by copying a large file or lots of small files to a vm(do it twice at the same time if you want to see if you max out) to test your iops. Or reboot a few vms at the same time. My storage array hits 1500 or so before it starts to peak, iirc from some tests i did back in the winter. Though i do wonder if some of that isnt bound by us using a Gb network rather than 10Gb.![iscsi iops for my XS001 Xenserver host]( 0_1466533825240_upload-9dea6a0f-8cd9-4594-aa69-430e8b1b3c56 image url)
This shows the last ten minutes of iops for all SRs attached to my XS001 host. The purple iscsi3 is an SR; i booted a vm that lives there that nobody ever uses.

jrc

So my IOPs seem to be jumping between 0 and 900k fairly quickly. But the Queue size seems to stay between 0 and 1, with the latency very low (near zero) as well. Network traffic is well under 1MBps. This is from the performance meters on the Xen master host.

scottalanmiller

@jrc said in Xenserver Space Woes:

So my IOPs seem to be jumping between 0 and 900k fairly quickly. But the Queue size seems to stay between 0 and 1, with the latency very low (near zero) as well. Network traffic is well under 1MBps. This is from the performance meters on the Xen master host.

Basically what that is telling me is that you have plenty of IOPS in reserve and you are never demanding more from it than it can provide. Those numbers are basically showing your storage as "idle" and ready for whatever you want to throw at it.

jrc

@scottalanmiller said in Xenserver Space Woes:

@jrc said in Xenserver Space Woes:

So my IOPs seem to be jumping between 0 and 900k fairly quickly. But the Queue size seems to stay between 0 and 1, with the latency very low (near zero) as well. Network traffic is well under 1MBps. This is from the performance meters on the Xen master host.

Basically what that is telling me is that you have plenty of IOPS in reserve and you are never demanding more from it than it can provide. Those numbers are basically showing your storage as "idle" and ready for whatever you want to throw at it.

Ok, so my gut on that was right. Then I need to work out why the leaf quiescence thingy is timing out, since it appears to not be a disk IO thing.

jrc

I fixed it! Shut down the VM, then ran an offline quiescence and that did it:

xe host-call-plugin host-uuid=<Host UUID> plugin=coalesce-leaf fn=leaf-coalesce args:vm_uuid=<VM UUID>

It did take about 45 minutes, but once it was done the space was free. Xencenter is now happily reporting the used space as 4127Gb and a virtually assigned is 4115Gb, it's not perfect, but I'll take it!

scottalanmiller

Awesome, glad that that fixed things.