Replacing the Dead IPOD, SAN Bit the Dust
-
@scottalanmiller said in Replacing the Dead IPOD, SAN Bit the Dust:
@dafyre said in Replacing the Dead IPOD, SAN Bit the Dust:
@scottalanmiller said in Replacing the Dead IPOD, SAN Bit the Dust:
@Aconboy said in Replacing the Dead IPOD, SAN Bit the Dust:
@scottalanmiller said in Replacing the Dead IPOD, SAN Bit the Dust:
25
looking at this thread, I would say that a Scale 1150 cluster would fit the bill nicely, and even with a single node for second site dr, he would still likely be under $35k all-in
That's what I was imagining. Might need slightly more than the baseline RAM, but even that might be enough with 2x 64GB nodes.
If they're not doing HA and all of that... why not get one beefier node rather than two smaller ones?
AKA Mainframe design.
Is it really mainframe design? don't a lot of mainframes have tons on internal redundancies and fail over components?
-
@Dashrender said in Replacing the Dead IPOD, SAN Bit the Dust:
@scottalanmiller said in Replacing the Dead IPOD, SAN Bit the Dust:
@dafyre said in Replacing the Dead IPOD, SAN Bit the Dust:
@scottalanmiller said in Replacing the Dead IPOD, SAN Bit the Dust:
@Aconboy said in Replacing the Dead IPOD, SAN Bit the Dust:
@scottalanmiller said in Replacing the Dead IPOD, SAN Bit the Dust:
25
looking at this thread, I would say that a Scale 1150 cluster would fit the bill nicely, and even with a single node for second site dr, he would still likely be under $35k all-in
That's what I was imagining. Might need slightly more than the baseline RAM, but even that might be enough with 2x 64GB nodes.
If they're not doing HA and all of that... why not get one beefier node rather than two smaller ones?
AKA Mainframe design.
Is it really mainframe design? don't a lot of mainframes have tons on internal redundancies and fail over components?
A "lot" of non-mainframes do, too. Those things are not what makes something a mainframe and lacking them is not what makes something else not a mainframe.
This is a "Mainframe Architecture", not a mainframe, meaning it is an architecture that is "Designed around a single highly reliable component" in contrast to other designs that rely on multiple components to make up for individual fragility.
-
gotcha.
-
I have to carve out an hour and a half to watch the two SAM presentations posted earlier in this thread...
-
Sounds like the business really wants something more robust, even if they didn't figure out how to do it the first time through, so going for something simple, but hyperconverged, seems like the obvious answer. Especially if it can come in way under the current expected budget.
-
Just saw another thread of someone who did the same thing.... depended on a black box SAN, let support lapse, and now is in tough shape: https://community.spiceworks.com/topic/1912628-emc-vnxe3100-troublesome-storage-pool-vmware-view-vdi
-
@scottalanmiller said in Replacing the Dead IPOD, SAN Bit the Dust:
Just saw another thread of someone who did the same thing.... depended on a black box SAN, let support lapse, and now is in tough shape: https://community.spiceworks.com/topic/1912628-emc-vnxe3100-troublesome-storage-pool-vmware-view-vdi
And they're using the VNXe line. Couldn't get in any worse shape.
-
@coliver said in Replacing the Dead IPOD, SAN Bit the Dust:
@scottalanmiller said in Replacing the Dead IPOD, SAN Bit the Dust:
Just saw another thread of someone who did the same thing.... depended on a black box SAN, let support lapse, and now is in tough shape: https://community.spiceworks.com/topic/1912628-emc-vnxe3100-troublesome-storage-pool-vmware-view-vdi
And they're using the VNXe line. Couldn't get in any worse shape.
Could be an MSA. Bwahahaha.
-
@scottalanmiller said in Replacing the Dead IPOD, SAN Bit the Dust:
@coliver said in Replacing the Dead IPOD, SAN Bit the Dust:
@scottalanmiller said in Replacing the Dead IPOD, SAN Bit the Dust:
Just saw another thread of someone who did the same thing.... depended on a black box SAN, let support lapse, and now is in tough shape: https://community.spiceworks.com/topic/1912628-emc-vnxe3100-troublesome-storage-pool-vmware-view-vdi
And they're using the VNXe line. Couldn't get in any worse shape.
Could be an MSA. Bwahahaha.
Some guy in the XS threads over the backup throughput problems is saying he's buying a brand new MSA for his new XS box...
-
@Dashrender said in Replacing the Dead IPOD, SAN Bit the Dust:
@scottalanmiller said in Replacing the Dead IPOD, SAN Bit the Dust:
@coliver said in Replacing the Dead IPOD, SAN Bit the Dust:
@scottalanmiller said in Replacing the Dead IPOD, SAN Bit the Dust:
Just saw another thread of someone who did the same thing.... depended on a black box SAN, let support lapse, and now is in tough shape: https://community.spiceworks.com/topic/1912628-emc-vnxe3100-troublesome-storage-pool-vmware-view-vdi
And they're using the VNXe line. Couldn't get in any worse shape.
Could be an MSA. Bwahahaha.
Some guy in the XS threads over the backup throughput problems is saying he's buying a brand new MSA for his new XS box...
Sad trombone plays.
-
@Aconboy @scottalanmiller Looks like I would need 2 of 1150's all decked out in order to handle the processing power for the datacenter.
-
@NerdyDad said in Replacing the Dead IPOD, SAN Bit the Dust:
@Aconboy @scottalanmiller Looks like I would need 2 of 1150's all decked out in order to handle the processing power for the datacenter.
No, you need three. Just how the clusters work. The smallest SSD hybrid cluster is three 1150s. That gives you two units to handle the load and one to provide for N+1 failover. While things are not in a failure state, you get the power of the extra CPU and only drop 33% when a full node has failed.
With only two, you have to have the full CPU and RAM of all workloads in every node. WIth three, you need less per node. You can get a pretty extreme amount of power in a single 1150, I doubt you need to go anywhere close to decked out.
You can also grow transparently in the future. Start with three today. Later you need more, just add a fourth node and you get more disk capacity, more IOPS, more RAM, more CPU. You just plug it in and let things load balance to the new node.
-
An update and closing to this problem. I also posted about this on SW as well. My RAID6 was on the verge of failing. I had 2 disks die on me, got them swapped, and the SAN was attempting to rebuilt both of them at the same time while a 3rd and 4th disk was wanting to die as well. During the rewrites to the drives, the SAN would hit a bad sector and fail the rewrite, causing the SAN to go offline and taking the vm down with it. We eventually had to take the last pulled drive to a data restoration place with a spare and they were able to get the data off of one drive and onto another overnight. That cost us about $2,400, but when you're talking about millions of dollars of orders a week, $2.4k is a drop in the bucket.
I did call Dell support and they were kind enough to remote in and assess the situation, even escalating it to a Storage Engineer. The SE got in and was able to tell me what was going on. My firmware was 2nd from the latest version, both cards were working well, but was about to lose the RAID6 array.
Lessons learned:
- Make sure and double check that you have backups to business critical servers. Test them. Especially if you have the spare hardware doing nothing. If you have the hardware and are not testing your backups, you are doing yourself a serious injustice. Please refer back to Veeam's 3-2-1 rule when it comes to backup strategizing.
- Keep an eye on your SANs and keep them happy. Replace disks when needed and keep the firmware up to date. Replace your disks and have spares on the shelf.
- Management (if you are listening): Put your IT department on a 5-7 year refresh cycle. All machines are man-made. Man is fallible. Therefore, so are the machines that they make. Machines are going to fail eventually. Make sure that you have an architecture that is fault tolerant and able to be replaced on a 5-7 year cycle. Plus, keep a maintenance agreement with each of these manufacturers as long as you are on the equipment.
- Assess your design architecture. Are you currently using the Inverted Pyramid of Doom (IPOD)? If so, and management allows, get off of it. Go to Hyperconvergence. At your main data location, make sure that you have at least 3 nodes. 2 for load balancing and 1 for failover. At each of your additional sites, put in at least 2 nodes, 1 for production and 1 for a backup. Still keep to the 5-7 year refresh cycle.
My management has decided not to check out hyperconvergence, but are sticking with the IPOD scheme for now. We are going to be reutilizing one of our EQL's for replication of data from the Compellent SAN. However, I want to note in one of @scottalanmiller's videos that added complexity does not increase resiliency in the network, but adds more of Moore's Law saying that if it can fail, it will fail.
-
Also, the SAN in question has bee retired. We have 2 others in our datacenter that has their data pulled from them and the SANs in question have been taken offline. I'll go and pull them out of the datacenter tomorrow.
-
@NerdyDad said in Replacing the Dead IPOD, SAN Bit the Dust:
- Management (if you are listening): Put your IT department on a 5-7 year refresh cycle.
That's not at all the issue here. There are three real issues, none of them related to the age of the equipment, this could have happened on day one with new gear.
- Using low end gear that isn't designed for high reliability when highly reliable is needed (you said that $2,400 for data recovery was a drop in the bucket, and yet they chose gear that doesn't reflect that financial reality). Your SAN is around the home line, it's not something I would use in any production scenario.
- Using an appliance without support. This is way below the home line.
- Using an architecture that is designed to be ultra risky without benefit. (You addressed, this, just pointing it out again.)
Fix any of those three mistakes that the issue would have been avoided.
-
@NerdyDad said in Replacing the Dead IPOD, SAN Bit the Dust:
My management has decided not to check out hyperconvergence, but are sticking with the IPOD scheme for now. We are going to be reutilizing one of our EQL's for replication of data from the Compellent SAN. However, I want to note in one of @scottalanmiller's videos that added complexity does not increase resiliency in the network, but adds more of Moore's Law saying that if it can fail, it will fail.
Wow, so at this point, they are committed to the fact that their systems aren't valuable. If I was the CEO, this is where I'd be investigating to see what is going on, where is the money flowing for these SANs and why would someone be spending so much money to put the company at risk.
-
Moving to the Compellent definitely helps, but retains all of the core problems.
-
The big thing that I would ask.... why didn't they do a post mortem to determine what went wrong? This was pretty huge and it sounds like they learned nothing from it and are burying their heads in the sand.
There should be a team investigation to determine how things go so bad. Finding the technical issues that I listed above would get them to a proximate tech failure point. But then there needs to be questions asked of "how did those mistakes happen." There is a management decision making problem somewhere in management that sounds like it is being ignored completely. It's known to exist, but I'm guessing that no one is checking on it at all. How do they expect to improve as a company if they ignore these problems? Not only does that avoid improvement, but in a way it rewards bad practices. No risk to screwing over the company by doing a bad job, no one will even mention it, I'm guessing.
In a healthy environment, there should be a team probing to figure out how things got this bad. Was it because someone in management doesn't know tech but injected an opinion? Did a tech person make a mistake? Was someone not doing their job and hoping that a sales guy would do it for free for them? Did someone get a kickback (more common than you'd think.)
-
@scottalanmiller One of those was a technician mistake by neglecting the alerts of the SAN. As said before, the SAN was throwing errors of disk failures. 2 disks had already failed and was trying to rebuild off of spares that it had. During this rebuild, 2 other disks were also wanting to fail but the SAN controllers were not allowing for it to fail.
I'm trying to start better practices in myself by checking in on these systems on a daily basis to make sure there are no actions that would need to be taken before alerts leads to issues.
We're only a 4-man team covering these 3 locations. IT Manager (Boss), SysAdmin (Me), 2 other guys in helpdesk. Not trying to promote laziness or anything, but I also can't monitor systems 24/7 or I'll find myself divorced and crazy real quick. I suppose there is a way to have a system monitor other systems and alert me if certain conditions arise? I assume off of such things such as SNMPv3 or something? Any recommendations?
-
Can the SANs fire off email alerts or SNMP traps or anything?