Where to find "best practice" for any given IT scenario
-
Discussing the ones you don't want simple means others will get confused. It's a great way to avoid unnecessary conversations.
-
@scottalanmiller said:
@Carnival-Boy said:
RAID 5 works. I'd consider it not be generally best practice on the grounds that it is slightly less reliable than RAID 10.
That's not at all why I consider it not a best practice. The reasons are complicated, far more than can be distilled to a quick statement. Also, RAID 5 does not work by most business definitions - it fails to provide the level of protection assumed or required. That's one piece.
It isn't that RAID 5 is slightly less than ten, it is that it is a full order of magnitude less than RAID 6 which, in turn, is less than RAID 10. But that isn't it either.
There are business scenarios where RAID 5, risky as it is, is certainly "safe enough". The issue becomes that when RAID 5 is "safe enough" there are other factors. It becomes too costly to make it safe enough, it fails to scale and remain safe, it isn't fast enough, etc. The combination of business factors around safety, speed, capacity and cost, no matter which one you optimize for, RAID 5 doesn't come out to be the right one, ever. It can be functional, but it can't ever be the right choice.
This, in turn, means that knowing this having RAID 5 in our decision matrix just increases the chances that we, as emotional humans, will get confused by the extra choice and make bad decisions. We do our best decision making when known bad choices are removed from consideration. If a computer was factoring all of the issues and making a determination having RAID 5 in the mix would not matter. But for humans, it really matters.
Isn't that pretty much what I said - it's slightly less reliable than RAID 10?
-
@Carnival-Boy It's not just a little bit, its 10's if not 100's of scales less reliable in a recovery situation.
The likelihood of recovering from a RAID 5 failure vs a RAID 10 failure is apples vs whales.
-
@Carnival-Boy said:
Isn't that pretty much what I said - it's slightly less reliable than RAID 10?
I would not use slightly when "order of magnitude" is involved
-
OK, take two typical SMB servers, each with 12 x 300GB disks. One is configured with RAID 10 and one is configured with RAID 5.
One of the disks in each machine fails and is replaced. What is the probability in each case that the array will not rebuild successfully? Roughly speaking.
-
@Carnival-Boy said:
OK, take two typical SMB servers, each with 12 x 300GB disks. One is configured with RAID 10 and one is configured with RAID 5.
One of the disks in each machine fails and is replaced. What is the probability in each case that the array will not rebuild successfully? Roughly speaking.
I can see Scott in the corner right now doing the math (or just posting a link to where he's already done the math). From what I recall, 3.3 TB has like a 30% chance of hitting a URE, AKA total failure of the array. At something around 12TB there is statistically a 100% chance of hitting a URE (OK it might actually be 99.99%)
-
@scottalanmiller said:
@Dashrender said:
@scottalanmiller said:
Best practice is to simply remove it from consideration to clarify the remaining choices.
This reminds me of Darren's talk at SpiceWorld - only give the CEO/CFO the choices that you approve. Never provide them one that you don't want, they'll always pick that one.
By providing it, you are presenting it as an option. Basically meaning you approved it. You can tell them your top choices, but if you include it in the list, it's approved to some degree or the conversation is confused.
What surprises me is how often IT people will present completely unreasonable options as options to management. If your car got a flat, would you offer to 1) fix the flat or 2) set the car on fire? No, you would not offer something ridiculous that isn't reasonable. But IT often does this to management.
The more likely scenario is that management will reject all provided solutions and ask why it can't be done cheaper. Of course it can be done cheaper, but with orders of magnitude more risk. What is the recommendation then?
-
Using RAID 6 or RAID 10 which are both safer than RAID 5.
-
@Dashrender said:
I can see Scott in the corner right now doing the math (or just posting a link to where he's already done the math).
Cool. Facts are important here. A failure probability of 0.001% is 100 times higher than 0.00001%, so on that grounds it is an order of magnitude less reliable. But both are such tiny numbers that they could be ignored. That's where 'slightly' more reliable would also apply.
To go back to your car analogy, a Fiat is (probably) an order of magnitude less reliable than a Honda, but both are so reliable that you wouldn't necessarily say buying a Honda is best practice. Importing a car from North Korea might, however, be considered bad practice.
-
@Carnival-Boy said:
OK, take two typical SMB servers, each with 12 x 300GB disks. One is configured with RAID 10 and one is configured with RAID 5.
One of the disks in each machine fails and is replaced. What is the probability in each case that the array will not rebuild successfully? Roughly speaking.
Can you even buy 300GB drives today?
Reliability should never be considered from a point of "already failed", that misses part of the big picture. A RAID 5 array is more likely to experience a drive failure than RAID 10 as a starting point. We need to think about the total reliability, not the reliability from a single scenario.
Imagine this question to demonstrate why this is important:
"Which is more likely to survive a front end collision of 20mph, a Volvo C70 or a Ford Pinto?" You'd say the Volvo C70, of course.
But that assumes both cars HAVE had that accident. What if that wasn't the whole scenario? Let's ask again...
"Which is more likely to injure its passengers, a Volvo C70 driving 50pmh on the highway or a Ford Pinto sitting idle in a garage?"
Suddenly the tables turn, because while one is more likely to survive an accident, the other is safer by avoiding the accident which is even more effective.
-
@Dashrender said:
@scottalanmiller said:
@Dashrender said:
@scottalanmiller said:
Best practice is to simply remove it from consideration to clarify the remaining choices.
This reminds me of Darren's talk at SpiceWorld - only give the CEO/CFO the choices that you approve. Never provide them one that you don't want, they'll always pick that one.
By providing it, you are presenting it as an option. Basically meaning you approved it. You can tell them your top choices, but if you include it in the list, it's approved to some degree or the conversation is confused.
What surprises me is how often IT people will present completely unreasonable options as options to management. If your car got a flat, would you offer to 1) fix the flat or 2) set the car on fire? No, you would not offer something ridiculous that isn't reasonable. But IT often does this to management.
The more likely scenario is that management will reject all provided solutions and ask why it can't be done cheaper. Of course it can be done cheaper, but with orders of magnitude more risk. What is the recommendation then?
You say that it cannot be done cheaper while meeting goals. Ask them what goal they want to drop to reduce cost.
-
@Dashrender said:
@Carnival-Boy said:
OK, take two typical SMB servers, each with 12 x 300GB disks. One is configured with RAID 10 and one is configured with RAID 5.
One of the disks in each machine fails and is replaced. What is the probability in each case that the array will not rebuild successfully? Roughly speaking.
I can see Scott in the corner right now doing the math (or just posting a link to where he's already done the math). From what I recall, 3.3 TB has like a 30% chance of hitting a URE, AKA total failure of the array. At something around 12TB there is statistically a 100% chance of hitting a URE (OK it might actually be 99.99%)
Not that risky on the small SAS drives that are implied. But still riskier.
-
@Carnival-Boy said:
@Dashrender said:
I can see Scott in the corner right now doing the math (or just posting a link to where he's already done the math).
Cool. Facts are important here. A failure probability of 0.001% is 100 times higher than 0.00001%, so on that grounds it is an order of magnitude less reliable. But both are such tiny numbers that they could be ignored. That's where 'slightly' more reliable would also apply.
Easy way to think of it is.... RAID 10 you should expect to go a lifetime without hearing about anyone who has ever had this issue. RAID 5 you should expect multiple complete failures in your career.
RAID 10 failure rates are less than 1 in 80,000 array years. RAID 5 is closer to 1 in 20.
There are so many factors that go into this from drives being more likely to fail, longer time for rebuilds, risk during rebuild, rebuild causing other drives to fail, risk of memory issues, etc.
-
Based on using the different RAID types, of course.
-
Trying to eyeball the math, at 3.3TB of usable data, that RAID 5 array would fail way over 50% of the time with consumer class drives (like Red Pro.) So enterprise drives (like RE) which are 10x more reliable in regards to URE we would expect rebuilt risk from URE alone to be 5% or higher.
That is a one in twenty chance that the RAID 5 array would lose all of its data. This does not take into account secondary drive failure risk which is pretty big as well.
I would not put a one in twenty or maybe one in ten chance of failure on the same playing field as "so reliable no study can measure it completely." RAID 10 failures at 80,000 array years was only the known healthy rate, all that is know is that it is more reliable than that. Zero failures at 80,000 array years!
-
OK, RAID 5 isn't best practice. That's a relatively easy one. Give me some more examples where the term "best practice" might apply. I'm not convinced the term is that meaningful.
I'm having an extension built on my house at the moment, and I hear the term used quite a bit by my builders. There's building regulations that are legally required and there's ones that are best practice. For example, a shaver point should be located at least 30cm from the sink. That's not a legal requirement, but it's best practice. Smoke detectors should be mains powered not battery powered. Again, that's best practice rather than a legal requirement. These practices are pretty formal though - either by the manufacturer, or by the building regulators. I don't see much equivalence in the IT industry (sadly, as it would be super useful).
-
Best Practice: If data is valuable enough to be stored, it should be backed up.
-
@Carnival-Boy said:
OK, RAID 5 isn't best practice. That's a relatively easy one.
Actually it is a hard one, while it is a well documented best practice among storage experts, the industry as a whole lacks that expertise and pushes it heavily.
-
It's an easy one for anyone who hangs around the same forums you do
-
Another best practice: virtualize every workload (unless it is impossible to do so)