Where to find "best practice" for any given IT scenario
-
Best Practices will essentially never involve technical details. Best Practices are called "practices" because it is generally about human processes, not technical details. Tech details change and exist in massive variety for a reason. If there were truly best technical processes, I would run out and create "Best Practices IT Gear" and sell pre-built, best practice equipment for a premium and no one would ever need to buy elsewhere or even work in IT. But I can't do that. There are things in IT we can "always avoid", like RAID 5 on spinning disks or not taking backups of important data, but there are very few things we "always do".
-
@DustinB3403 said:
@Carnival-Boy OK think of it like this, the software comparison is like comparing two cars, both do the exact same thing, get you from point A to Point B.
Both are reliable enough to work consistently.
I'm really not sure of your definition of best practice. I'm not sure what my definition is, but I'm sure it would be different to yours. Your example works for RAID too. RAID 5 and RAID 10 do the same thing and both are reliable enough to work consistently. RAID 5 works. I'd consider it not be generally best practice on the grounds that it is slightly less reliable than RAID 10. Much like cars, they all work, but some are more reliable than others.
Another example. I currently use Google Drive as a secondary backup of my corporate data. That is definitely not best practice (in my definition, anyways :)). It is not designed for that and it is not massively reliable. But it works. No IT firm would recommend that solution to their clients (I'd hope), but I'm still happy to use it. Partly because I understand the risks.
Another example. I believe it is best practice to only buy HP or Dell PCs in a corporate environment. But as with you car example, other manufacturers are perfectly reliable and most of the time you'd have no problems with them. But I think people should only buy HP/Dell.
Hmmmn, actually maybe that's more of a rule of thumb than best practice
-
@Carnival-Boy said:
RAID 5 works. I'd consider it not be generally best practice on the grounds that it is slightly less reliable than RAID 10.
That's not at all why I consider it not a best practice. The reasons are complicated, far more than can be distilled to a quick statement. Also, RAID 5 does not work by most business definitions - it fails to provide the level of protection assumed or required. That's one piece.
It isn't that RAID 5 is slightly less than ten, it is that it is a full order of magnitude less than RAID 6 which, in turn, is less than RAID 10. But that isn't it either.
There are business scenarios where RAID 5, risky as it is, is certainly "safe enough". The issue becomes that when RAID 5 is "safe enough" there are other factors. It becomes too costly to make it safe enough, it fails to scale and remain safe, it isn't fast enough, etc. The combination of business factors around safety, speed, capacity and cost, no matter which one you optimize for, RAID 5 doesn't come out to be the right one, ever. It can be functional, but it can't ever be the right choice.
This, in turn, means that knowing this having RAID 5 in our decision matrix just increases the chances that we, as emotional humans, will get confused by the extra choice and make bad decisions. We do our best decision making when known bad choices are removed from consideration. If a computer was factoring all of the issues and making a determination having RAID 5 in the mix would not matter. But for humans, it really matters.
So not only is it never the right choice, but it could actively influence us to make bad choices too.
Best practice is to simply remove it from consideration to clarify the remaining choices.
-
@scottalanmiller said:
Best practice is to simply remove it from consideration to clarify the remaining choices.
This reminds me of Darren's talk at SpiceWorld - only give the CEO/CFO the choices that you approve. Never provide them one that you don't want, they'll always pick that one.
-
@Dashrender said:
@scottalanmiller said:
Best practice is to simply remove it from consideration to clarify the remaining choices.
This reminds me of Darren's talk at SpiceWorld - only give the CEO/CFO the choices that you approve. Never provide them one that you don't want, they'll always pick that one.
By providing it, you are presenting it as an option. Basically meaning you approved it. You can tell them your top choices, but if you include it in the list, it's approved to some degree or the conversation is confused.
What surprises me is how often IT people will present completely unreasonable options as options to management. If your car got a flat, would you offer to 1) fix the flat or 2) set the car on fire? No, you would not offer something ridiculous that isn't reasonable. But IT often does this to management.
-
Discussing the ones you don't want simple means others will get confused. It's a great way to avoid unnecessary conversations.
-
@scottalanmiller said:
@Carnival-Boy said:
RAID 5 works. I'd consider it not be generally best practice on the grounds that it is slightly less reliable than RAID 10.
That's not at all why I consider it not a best practice. The reasons are complicated, far more than can be distilled to a quick statement. Also, RAID 5 does not work by most business definitions - it fails to provide the level of protection assumed or required. That's one piece.
It isn't that RAID 5 is slightly less than ten, it is that it is a full order of magnitude less than RAID 6 which, in turn, is less than RAID 10. But that isn't it either.
There are business scenarios where RAID 5, risky as it is, is certainly "safe enough". The issue becomes that when RAID 5 is "safe enough" there are other factors. It becomes too costly to make it safe enough, it fails to scale and remain safe, it isn't fast enough, etc. The combination of business factors around safety, speed, capacity and cost, no matter which one you optimize for, RAID 5 doesn't come out to be the right one, ever. It can be functional, but it can't ever be the right choice.
This, in turn, means that knowing this having RAID 5 in our decision matrix just increases the chances that we, as emotional humans, will get confused by the extra choice and make bad decisions. We do our best decision making when known bad choices are removed from consideration. If a computer was factoring all of the issues and making a determination having RAID 5 in the mix would not matter. But for humans, it really matters.
Isn't that pretty much what I said - it's slightly less reliable than RAID 10?
-
@Carnival-Boy It's not just a little bit, its 10's if not 100's of scales less reliable in a recovery situation.
The likelihood of recovering from a RAID 5 failure vs a RAID 10 failure is apples vs whales.
-
@Carnival-Boy said:
Isn't that pretty much what I said - it's slightly less reliable than RAID 10?
I would not use slightly when "order of magnitude" is involved
-
OK, take two typical SMB servers, each with 12 x 300GB disks. One is configured with RAID 10 and one is configured with RAID 5.
One of the disks in each machine fails and is replaced. What is the probability in each case that the array will not rebuild successfully? Roughly speaking.
-
@Carnival-Boy said:
OK, take two typical SMB servers, each with 12 x 300GB disks. One is configured with RAID 10 and one is configured with RAID 5.
One of the disks in each machine fails and is replaced. What is the probability in each case that the array will not rebuild successfully? Roughly speaking.
I can see Scott in the corner right now doing the math (or just posting a link to where he's already done the math). From what I recall, 3.3 TB has like a 30% chance of hitting a URE, AKA total failure of the array. At something around 12TB there is statistically a 100% chance of hitting a URE (OK it might actually be 99.99%)
-
@scottalanmiller said:
@Dashrender said:
@scottalanmiller said:
Best practice is to simply remove it from consideration to clarify the remaining choices.
This reminds me of Darren's talk at SpiceWorld - only give the CEO/CFO the choices that you approve. Never provide them one that you don't want, they'll always pick that one.
By providing it, you are presenting it as an option. Basically meaning you approved it. You can tell them your top choices, but if you include it in the list, it's approved to some degree or the conversation is confused.
What surprises me is how often IT people will present completely unreasonable options as options to management. If your car got a flat, would you offer to 1) fix the flat or 2) set the car on fire? No, you would not offer something ridiculous that isn't reasonable. But IT often does this to management.
The more likely scenario is that management will reject all provided solutions and ask why it can't be done cheaper. Of course it can be done cheaper, but with orders of magnitude more risk. What is the recommendation then?
-
Using RAID 6 or RAID 10 which are both safer than RAID 5.
-
@Dashrender said:
I can see Scott in the corner right now doing the math (or just posting a link to where he's already done the math).
Cool. Facts are important here. A failure probability of 0.001% is 100 times higher than 0.00001%, so on that grounds it is an order of magnitude less reliable. But both are such tiny numbers that they could be ignored. That's where 'slightly' more reliable would also apply.
To go back to your car analogy, a Fiat is (probably) an order of magnitude less reliable than a Honda, but both are so reliable that you wouldn't necessarily say buying a Honda is best practice. Importing a car from North Korea might, however, be considered bad practice.
-
@Carnival-Boy said:
OK, take two typical SMB servers, each with 12 x 300GB disks. One is configured with RAID 10 and one is configured with RAID 5.
One of the disks in each machine fails and is replaced. What is the probability in each case that the array will not rebuild successfully? Roughly speaking.
Can you even buy 300GB drives today?
Reliability should never be considered from a point of "already failed", that misses part of the big picture. A RAID 5 array is more likely to experience a drive failure than RAID 10 as a starting point. We need to think about the total reliability, not the reliability from a single scenario.
Imagine this question to demonstrate why this is important:
"Which is more likely to survive a front end collision of 20mph, a Volvo C70 or a Ford Pinto?" You'd say the Volvo C70, of course.
But that assumes both cars HAVE had that accident. What if that wasn't the whole scenario? Let's ask again...
"Which is more likely to injure its passengers, a Volvo C70 driving 50pmh on the highway or a Ford Pinto sitting idle in a garage?"
Suddenly the tables turn, because while one is more likely to survive an accident, the other is safer by avoiding the accident which is even more effective.
-
@Dashrender said:
@scottalanmiller said:
@Dashrender said:
@scottalanmiller said:
Best practice is to simply remove it from consideration to clarify the remaining choices.
This reminds me of Darren's talk at SpiceWorld - only give the CEO/CFO the choices that you approve. Never provide them one that you don't want, they'll always pick that one.
By providing it, you are presenting it as an option. Basically meaning you approved it. You can tell them your top choices, but if you include it in the list, it's approved to some degree or the conversation is confused.
What surprises me is how often IT people will present completely unreasonable options as options to management. If your car got a flat, would you offer to 1) fix the flat or 2) set the car on fire? No, you would not offer something ridiculous that isn't reasonable. But IT often does this to management.
The more likely scenario is that management will reject all provided solutions and ask why it can't be done cheaper. Of course it can be done cheaper, but with orders of magnitude more risk. What is the recommendation then?
You say that it cannot be done cheaper while meeting goals. Ask them what goal they want to drop to reduce cost.
-
@Dashrender said:
@Carnival-Boy said:
OK, take two typical SMB servers, each with 12 x 300GB disks. One is configured with RAID 10 and one is configured with RAID 5.
One of the disks in each machine fails and is replaced. What is the probability in each case that the array will not rebuild successfully? Roughly speaking.
I can see Scott in the corner right now doing the math (or just posting a link to where he's already done the math). From what I recall, 3.3 TB has like a 30% chance of hitting a URE, AKA total failure of the array. At something around 12TB there is statistically a 100% chance of hitting a URE (OK it might actually be 99.99%)
Not that risky on the small SAS drives that are implied. But still riskier.
-
@Carnival-Boy said:
@Dashrender said:
I can see Scott in the corner right now doing the math (or just posting a link to where he's already done the math).
Cool. Facts are important here. A failure probability of 0.001% is 100 times higher than 0.00001%, so on that grounds it is an order of magnitude less reliable. But both are such tiny numbers that they could be ignored. That's where 'slightly' more reliable would also apply.
Easy way to think of it is.... RAID 10 you should expect to go a lifetime without hearing about anyone who has ever had this issue. RAID 5 you should expect multiple complete failures in your career.
RAID 10 failure rates are less than 1 in 80,000 array years. RAID 5 is closer to 1 in 20.
There are so many factors that go into this from drives being more likely to fail, longer time for rebuilds, risk during rebuild, rebuild causing other drives to fail, risk of memory issues, etc.
-
Based on using the different RAID types, of course.
-
Trying to eyeball the math, at 3.3TB of usable data, that RAID 5 array would fail way over 50% of the time with consumer class drives (like Red Pro.) So enterprise drives (like RE) which are 10x more reliable in regards to URE we would expect rebuilt risk from URE alone to be 5% or higher.
That is a one in twenty chance that the RAID 5 array would lose all of its data. This does not take into account secondary drive failure risk which is pretty big as well.
I would not put a one in twenty or maybe one in ten chance of failure on the same playing field as "so reliable no study can measure it completely." RAID 10 failures at 80,000 array years was only the known healthy rate, all that is know is that it is more reliable than that. Zero failures at 80,000 array years!