Replacing the Dead IPOD, SAN Bit the Dust
-
Just saw another thread of someone who did the same thing.... depended on a black box SAN, let support lapse, and now is in tough shape: https://community.spiceworks.com/topic/1912628-emc-vnxe3100-troublesome-storage-pool-vmware-view-vdi
-
@scottalanmiller said in Replacing the Dead IPOD, SAN Bit the Dust:
Just saw another thread of someone who did the same thing.... depended on a black box SAN, let support lapse, and now is in tough shape: https://community.spiceworks.com/topic/1912628-emc-vnxe3100-troublesome-storage-pool-vmware-view-vdi
And they're using the VNXe line. Couldn't get in any worse shape.
-
@coliver said in Replacing the Dead IPOD, SAN Bit the Dust:
@scottalanmiller said in Replacing the Dead IPOD, SAN Bit the Dust:
Just saw another thread of someone who did the same thing.... depended on a black box SAN, let support lapse, and now is in tough shape: https://community.spiceworks.com/topic/1912628-emc-vnxe3100-troublesome-storage-pool-vmware-view-vdi
And they're using the VNXe line. Couldn't get in any worse shape.
Could be an MSA. Bwahahaha.
-
@scottalanmiller said in Replacing the Dead IPOD, SAN Bit the Dust:
@coliver said in Replacing the Dead IPOD, SAN Bit the Dust:
@scottalanmiller said in Replacing the Dead IPOD, SAN Bit the Dust:
Just saw another thread of someone who did the same thing.... depended on a black box SAN, let support lapse, and now is in tough shape: https://community.spiceworks.com/topic/1912628-emc-vnxe3100-troublesome-storage-pool-vmware-view-vdi
And they're using the VNXe line. Couldn't get in any worse shape.
Could be an MSA. Bwahahaha.
Some guy in the XS threads over the backup throughput problems is saying he's buying a brand new MSA for his new XS box...
-
@Dashrender said in Replacing the Dead IPOD, SAN Bit the Dust:
@scottalanmiller said in Replacing the Dead IPOD, SAN Bit the Dust:
@coliver said in Replacing the Dead IPOD, SAN Bit the Dust:
@scottalanmiller said in Replacing the Dead IPOD, SAN Bit the Dust:
Just saw another thread of someone who did the same thing.... depended on a black box SAN, let support lapse, and now is in tough shape: https://community.spiceworks.com/topic/1912628-emc-vnxe3100-troublesome-storage-pool-vmware-view-vdi
And they're using the VNXe line. Couldn't get in any worse shape.
Could be an MSA. Bwahahaha.
Some guy in the XS threads over the backup throughput problems is saying he's buying a brand new MSA for his new XS box...
Sad trombone plays.
-
@Aconboy @scottalanmiller Looks like I would need 2 of 1150's all decked out in order to handle the processing power for the datacenter.
-
@NerdyDad said in Replacing the Dead IPOD, SAN Bit the Dust:
@Aconboy @scottalanmiller Looks like I would need 2 of 1150's all decked out in order to handle the processing power for the datacenter.
No, you need three. Just how the clusters work. The smallest SSD hybrid cluster is three 1150s. That gives you two units to handle the load and one to provide for N+1 failover. While things are not in a failure state, you get the power of the extra CPU and only drop 33% when a full node has failed.
With only two, you have to have the full CPU and RAM of all workloads in every node. WIth three, you need less per node. You can get a pretty extreme amount of power in a single 1150, I doubt you need to go anywhere close to decked out.
You can also grow transparently in the future. Start with three today. Later you need more, just add a fourth node and you get more disk capacity, more IOPS, more RAM, more CPU. You just plug it in and let things load balance to the new node.
-
An update and closing to this problem. I also posted about this on SW as well. My RAID6 was on the verge of failing. I had 2 disks die on me, got them swapped, and the SAN was attempting to rebuilt both of them at the same time while a 3rd and 4th disk was wanting to die as well. During the rewrites to the drives, the SAN would hit a bad sector and fail the rewrite, causing the SAN to go offline and taking the vm down with it. We eventually had to take the last pulled drive to a data restoration place with a spare and they were able to get the data off of one drive and onto another overnight. That cost us about $2,400, but when you're talking about millions of dollars of orders a week, $2.4k is a drop in the bucket.
I did call Dell support and they were kind enough to remote in and assess the situation, even escalating it to a Storage Engineer. The SE got in and was able to tell me what was going on. My firmware was 2nd from the latest version, both cards were working well, but was about to lose the RAID6 array.
Lessons learned:
- Make sure and double check that you have backups to business critical servers. Test them. Especially if you have the spare hardware doing nothing. If you have the hardware and are not testing your backups, you are doing yourself a serious injustice. Please refer back to Veeam's 3-2-1 rule when it comes to backup strategizing.
- Keep an eye on your SANs and keep them happy. Replace disks when needed and keep the firmware up to date. Replace your disks and have spares on the shelf.
- Management (if you are listening): Put your IT department on a 5-7 year refresh cycle. All machines are man-made. Man is fallible. Therefore, so are the machines that they make. Machines are going to fail eventually. Make sure that you have an architecture that is fault tolerant and able to be replaced on a 5-7 year cycle. Plus, keep a maintenance agreement with each of these manufacturers as long as you are on the equipment.
- Assess your design architecture. Are you currently using the Inverted Pyramid of Doom (IPOD)? If so, and management allows, get off of it. Go to Hyperconvergence. At your main data location, make sure that you have at least 3 nodes. 2 for load balancing and 1 for failover. At each of your additional sites, put in at least 2 nodes, 1 for production and 1 for a backup. Still keep to the 5-7 year refresh cycle.
My management has decided not to check out hyperconvergence, but are sticking with the IPOD scheme for now. We are going to be reutilizing one of our EQL's for replication of data from the Compellent SAN. However, I want to note in one of @scottalanmiller's videos that added complexity does not increase resiliency in the network, but adds more of Moore's Law saying that if it can fail, it will fail.
-
Also, the SAN in question has bee retired. We have 2 others in our datacenter that has their data pulled from them and the SANs in question have been taken offline. I'll go and pull them out of the datacenter tomorrow.
-
@NerdyDad said in Replacing the Dead IPOD, SAN Bit the Dust:
- Management (if you are listening): Put your IT department on a 5-7 year refresh cycle.
That's not at all the issue here. There are three real issues, none of them related to the age of the equipment, this could have happened on day one with new gear.
- Using low end gear that isn't designed for high reliability when highly reliable is needed (you said that $2,400 for data recovery was a drop in the bucket, and yet they chose gear that doesn't reflect that financial reality). Your SAN is around the home line, it's not something I would use in any production scenario.
- Using an appliance without support. This is way below the home line.
- Using an architecture that is designed to be ultra risky without benefit. (You addressed, this, just pointing it out again.)
Fix any of those three mistakes that the issue would have been avoided.
-
@NerdyDad said in Replacing the Dead IPOD, SAN Bit the Dust:
My management has decided not to check out hyperconvergence, but are sticking with the IPOD scheme for now. We are going to be reutilizing one of our EQL's for replication of data from the Compellent SAN. However, I want to note in one of @scottalanmiller's videos that added complexity does not increase resiliency in the network, but adds more of Moore's Law saying that if it can fail, it will fail.
Wow, so at this point, they are committed to the fact that their systems aren't valuable. If I was the CEO, this is where I'd be investigating to see what is going on, where is the money flowing for these SANs and why would someone be spending so much money to put the company at risk.
-
Moving to the Compellent definitely helps, but retains all of the core problems.
-
The big thing that I would ask.... why didn't they do a post mortem to determine what went wrong? This was pretty huge and it sounds like they learned nothing from it and are burying their heads in the sand.
There should be a team investigation to determine how things go so bad. Finding the technical issues that I listed above would get them to a proximate tech failure point. But then there needs to be questions asked of "how did those mistakes happen." There is a management decision making problem somewhere in management that sounds like it is being ignored completely. It's known to exist, but I'm guessing that no one is checking on it at all. How do they expect to improve as a company if they ignore these problems? Not only does that avoid improvement, but in a way it rewards bad practices. No risk to screwing over the company by doing a bad job, no one will even mention it, I'm guessing.
In a healthy environment, there should be a team probing to figure out how things got this bad. Was it because someone in management doesn't know tech but injected an opinion? Did a tech person make a mistake? Was someone not doing their job and hoping that a sales guy would do it for free for them? Did someone get a kickback (more common than you'd think.)
-
@scottalanmiller One of those was a technician mistake by neglecting the alerts of the SAN. As said before, the SAN was throwing errors of disk failures. 2 disks had already failed and was trying to rebuild off of spares that it had. During this rebuild, 2 other disks were also wanting to fail but the SAN controllers were not allowing for it to fail.
I'm trying to start better practices in myself by checking in on these systems on a daily basis to make sure there are no actions that would need to be taken before alerts leads to issues.
We're only a 4-man team covering these 3 locations. IT Manager (Boss), SysAdmin (Me), 2 other guys in helpdesk. Not trying to promote laziness or anything, but I also can't monitor systems 24/7 or I'll find myself divorced and crazy real quick. I suppose there is a way to have a system monitor other systems and alert me if certain conditions arise? I assume off of such things such as SNMPv3 or something? Any recommendations?
-
Can the SANs fire off email alerts or SNMP traps or anything?
-
@dafyre Typically yes, but the storage consultant advised that we not connect the storage to the house network as it posses a security issue. My thought process is that if they are already within the network then they are going to get to the data, then they are going to get through to the virtual environment anyways. If they are already in your network, then they are probably using either an admin account or a service account. Either way, they're getting in.
-
@NerdyDad said in Replacing the Dead IPOD, SAN Bit the Dust:
@dafyre Typically yes, but the storage consultant advised that we not connect the storage to the house network as it posses a security issue. My thought process is that if they are already within the network then they are going to get to the data, then they are going to get through to the virtual environment anyways. If they are already in your network, then they are probably using either an admin account or a service account. Either way, they're getting in.
Typical recommendations I've seen are for there to be a management VLAN, and a separate VLAN for the actual storage traffic... But as you say, when hackers get in, you have bigger problems anyhow.
My 2c worth would be to set up the email alerts anyway... it will save you this pain later on down the road. I'd set it up on any SAN you have that has the option, lol.
-
@dafyre said in Replacing the Dead IPOD, SAN Bit the Dust:
Can the SANs fire off email alerts or SNMP traps or anything?
Pretty much any device can do that.
-
@dafyre said in Replacing the Dead IPOD, SAN Bit the Dust:
Typical recommendations I've seen are for there to be a management VLAN, and a separate VLAN for the actual storage traffic... But as you say, when hackers get in, you have bigger problems anyhow.
Storage should always be a true physical SAN, not a VLAN SAN. VLAN is fine for security, but you want a physically separate SAN to make sure that the backplane does not get overloaded. It's performance and reliability why you keep the SAN separate physically.
-
@StrongBad said in Replacing the Dead IPOD, SAN Bit the Dust:
@dafyre said in Replacing the Dead IPOD, SAN Bit the Dust:
Typical recommendations I've seen are for there to be a management VLAN, and a separate VLAN for the actual storage traffic... But as you say, when hackers get in, you have bigger problems anyhow.
Storage should always be a true physical SAN, not a VLAN SAN. VLAN is fine for security, but you want a physically separate SAN to make sure that the backplane does not get overloaded. It's performance and reliability why you keep the SAN separate physically.
The recommendations I saw were to keep the actual SAN storage traffic separate from the rest of the network to improve performance and security.