This time i really thought i had it.
But hopefully i have finally fixed it this afternoon.
Today right as people were leaving, this error happened again.
After i had increased the dom0 memory allocation. Only this time it didnt last 7 minutes, it lasted over 20. Probably due to the increased memory available for dom0 as that was the only change i made recently.
SMlog full of SR_BACKEND_FAILUREs, timeouts all that bad stuff previously mentioned.
So then when it all started working again i look at smlog as it should normally be in my environment.
Every 30 seconds there is some message:
'''
XS001 SM: [16965] sr_scan {'sr_uuid': 'cc37b853-066e-fbcb-f5c2-dcca47fd168b', 'subtask_of': 'DummyRef:|737fa116-27fc-0ad6-c923-335d7d645e68|SR.scan', 'args': [], 'host_ref': 'OpaqueRef:ff71ac2a-851d-36dc-43e4-6ea0708498e9', 'session_ref': 'OpaqueRef:7433c31d-3a94-75fa-316b-c0549ce51389', 'device_config': {'username': 'admin', 'type': 'cifs', 'SRmaster': 'true', 'cifspassword_secret': pw hash removed', 'location': '//10.1.0.10/iso'}, 'command': 'sr_scan', 'sr_ref': 'OpaqueRef:77fe45fb-7f66-b2d0-1aac-72e990bfa378'}
'''
I go back, looking at all the archived SMlog.x.gz logs.
This message has been happening for at least 7 months, every 30 seconds without fail. In fact, i thought it was a normal SMlog message because it has been happening at least since my first day on the job. It is also always the same SR uuid # that shows up; which was the CIFS ISO share SR for Xen that was made right after installation back in 2013(not by me). I would guess these type of errors have been happening since 2013, like clockwork almost.
About ready to throw my fist through something at this point, and users are unhappy.
I decided to unplug and forget this SR, and recreate it as NFS ISO share in XS rather than CIFS/SMB
Have done that now, all i can do is wait a few days to see if storage errors occur again. I can say however, those SMlog messages dont show up every 30 seconds anymore.
In fact the only messages showing up are Unitrends snapshotting and attaching itself to vdis for backups right now, so at least it is 'back to normal' for now. Though normal is $#@!ed apparently.
It also sheds light as to why previous guy would have reduced memory allocation to dom0, as this seems to reduce the time of these timeouts, while adding more increases the time of them(allegedly, i will know by Monday). If this is the actual fix, it means i will have solved a multiyear problem that 3 other people in my position were unable to solve. I really hope this is it.
And, does anybody actually know what that 'error' message in SMlog means? It doesnt have the word error, doesnt say anything bad, just lists some uuids.
Hopefully this horse is way past dead, and I can go clean my shillelagh.