Safe to have a 48TB Windows volume?



  • With lots of double disk failures, the real thing you need to be looking at is the disks that you have or the environment that they are in. RAID 5 carries huge risk, but it shouldn't primarily be from double disk failures. That that is what led you away from RAID 5 should have been a red flag that something else was wrong. Double disk failure can happen to anyone, of course, but lots of them indicates a trend that isn't RAID related.



  • @scottalanmiller said in Safe to have a 48TB Windows volume?:

    @PhlipElder said in Safe to have a 48TB Windows volume?:

    @scottalanmiller said in Safe to have a 48TB Windows volume?:

    @PhlipElder said in Safe to have a 48TB Windows volume?:

    @scottalanmiller said in Safe to have a 48TB Windows volume?:

    Some examples of things we have math to tell us are good or bad...

    RAID 10 .... we've done massive empirical studies. We know that the RAID systems themselves are insanely reliable.
    Cheap SAN like the P2000 .... we know that by collecting anecdotes, and knowing total sales figures, that the failure rates of those observed alone is too high for the entire existing set of products made, and we can safely assume that the number we have not observed is vastly higher. But observation alone tells us that the reliability is not high enough for any production use.

    We lost an entire virtualization platform and had to recover from scratch because the second member of a RAID 10 pair failed after replacing the first and a rebuild initiating. We'll stick with RAID 6 thanks.

    EDIT: The on-site IT and I were well into our coffee chat when the spontaneous beep/beep happened and we were both, WTF?

    See, that's an irrational, emotional reaction that we are trying to avoid. You have one anecdote that tells you nothing, but you make a decision based on it that goes against math and empirical studies. Why?

    The fact that you and possibly your org has actually studied things is important to the discussion.

    I've published about it and speak about it all the time. The study was massive. And took forever. As you can imagine.

    One of the reasons we adopted Storage Spaces as a platform was because of the auto-retire and rebuild into free pool space via parallel rebuild. With, at that time, 2TB and larger drives becoming all the more common rebuild times on the RAID controller were taking a long time to happen.

    Parallel rebuild into free pool space rebuilds that dead disk across all members in the pool. So, 2TB gets done in minutes instead of hours/days. Plus, once that dead drive's contents is rebuilt into free pool space so long as there is more free pool space to be had (size of disk + ~150GB) another disk failure can happen and still maintain the two disk resilience (Dual Parity or 3-Way Mirror).

    RAID can't do that for us.



  • @PhlipElder said in Safe to have a 48TB Windows volume?:

    @scottalanmiller said in Safe to have a 48TB Windows volume?:

    @PhlipElder said in Safe to have a 48TB Windows volume?:

    @scottalanmiller said in Safe to have a 48TB Windows volume?:

    @PhlipElder said in Safe to have a 48TB Windows volume?:

    @scottalanmiller said in Safe to have a 48TB Windows volume?:

    Some examples of things we have math to tell us are good or bad...

    RAID 10 .... we've done massive empirical studies. We know that the RAID systems themselves are insanely reliable.
    Cheap SAN like the P2000 .... we know that by collecting anecdotes, and knowing total sales figures, that the failure rates of those observed alone is too high for the entire existing set of products made, and we can safely assume that the number we have not observed is vastly higher. But observation alone tells us that the reliability is not high enough for any production use.

    We lost an entire virtualization platform and had to recover from scratch because the second member of a RAID 10 pair failed after replacing the first and a rebuild initiating. We'll stick with RAID 6 thanks.

    EDIT: The on-site IT and I were well into our coffee chat when the spontaneous beep/beep happened and we were both, WTF?

    See, that's an irrational, emotional reaction that we are trying to avoid. You have one anecdote that tells you nothing, but you make a decision based on it that goes against math and empirical studies. Why?

    The fact that you and possibly your org has actually studied things is important to the discussion.

    I've published about it and speak about it all the time. The study was massive. And took forever. As you can imagine.

    One of the reasons we adopted Storage Spaces as a platform was because of the auto-retire and rebuild into free pool space via parallel rebuild. With, at that time, 2TB and larger drives becoming all the more common rebuild times on the RAID controller were taking a long time to happen.

    RAID 1 / 10 rebuilds are generally... acceptable. But if you were choosing RAID 5 or 6, then the rebuilds are absurd. But we've known that for a long time. It just takes so much work to rebuild a system of that nature.

    But this goes into the earlier discussion, if you were using math rather than emotions before moving to RAIN, it seems it would have kept you to RAID 1 or 10 all along.



  • @scottalanmiller said in Safe to have a 48TB Windows volume?:

    With lots of double disk failures, the real thing you need to be looking at is the disks that you have or the environment that they are in. RAID 5 carries huge risk, but it shouldn't primarily be from double disk failures. That that is what led you away from RAID 5 should have been a red flag that something else was wrong. Double disk failure can happen to anyone, of course, but lots of them indicates a trend that isn't RAID related.

    One was environment. The site had the HVAC above the ceiling tiles all messed up with primary paths not capped. So, air return did not work and A/C in the summer stayed above the ceiling tiles and heat in the winter as well. The server closet during the winter could easily hit 40C. There were no more circuits available anywhere in the leased space so we couldn't even get a portable A/C in there.

    We experienced four, count them, four catastrophic failures at that site. The owners knew why but we were helpless against it. So, we build-out a highly available system using two servers, third party products, and a really good backup set (BUE failed us so we moved to Storagecraft ShadowProtect which has been flawless to date).

    There's statistics. Then there's d*mned statistics. 😉



  • @PhlipElder said in Safe to have a 48TB Windows volume?:

    Parallel rebuild into free pool space rebuilds that dead disk across all members in the pool. So, 2TB gets done in minutes instead of hours/days. Plus, once that dead drive's contents is rebuilt into free pool space so long as there is more free pool space to be had (size of disk + ~150GB) another disk failure can happen and still maintain the two disk resilience (Dual Parity or 3-Way Mirror).

    RAID can't do that for us.

    Absolutely, this is a huge reason why RAIN has been replacing RAID for a long time. We've had that for many years. Large capacity is making RAID simply ineffective, no surprises there. "Shuffling" data around as needed is a powerful tool.

    Technically, RAID can do this, but does it very poorly. It's a feature of hybrid RAID.



  • @scottalanmiller said in Safe to have a 48TB Windows volume?:

    @PhlipElder said in Safe to have a 48TB Windows volume?:

    Parallel rebuild into free pool space rebuilds that dead disk across all members in the pool. So, 2TB gets done in minutes instead of hours/days. Plus, once that dead drive's contents is rebuilt into free pool space so long as there is more free pool space to be had (size of disk + ~150GB) another disk failure can happen and still maintain the two disk resilience (Dual Parity or 3-Way Mirror).

    RAID can't do that for us.

    Absolutely, this is a huge reason why RAIN has been replacing RAID for a long time. We've had that for many years. Large capacity is making RAID simply ineffective, no surprises there. "Shuffling" data around as needed is a powerful tool.

    Technically, RAID can do this, but does it very poorly. It's a feature of hybrid RAID.

    We're seeing the same thing in Solid-State now too. As SSD vendors deliver larger and larger capacity devices the write speeds all of a sudden become a limiting factor. Go figure. :S



  • @PhlipElder said in Safe to have a 48TB Windows volume?:

    @scottalanmiller said in Safe to have a 48TB Windows volume?:

    @PhlipElder said in Safe to have a 48TB Windows volume?:

    Parallel rebuild into free pool space rebuilds that dead disk across all members in the pool. So, 2TB gets done in minutes instead of hours/days. Plus, once that dead drive's contents is rebuilt into free pool space so long as there is more free pool space to be had (size of disk + ~150GB) another disk failure can happen and still maintain the two disk resilience (Dual Parity or 3-Way Mirror).

    RAID can't do that for us.

    Absolutely, this is a huge reason why RAIN has been replacing RAID for a long time. We've had that for many years. Large capacity is making RAID simply ineffective, no surprises there. "Shuffling" data around as needed is a powerful tool.

    Technically, RAID can do this, but does it very poorly. It's a feature of hybrid RAID.

    We're seeing the same thing in Solid-State now too. As SSD vendors deliver larger and larger capacity devices the write speeds all of a sudden become a limiting factor. Go figure. :S

    Yes, RAID will unlikely ever make a large come back. The scale of storage in the future simply makes device-centric protection ineffecitve long term.



  • @PhlipElder said in Safe to have a 48TB Windows volume?:

    (BUE failed us so we moved to Storagecraft ShadowProtect which has been flawless to date).

    What is BUE?



  • @FATeknollogee said in Safe to have a 48TB Windows volume?:

    @PhlipElder said in Safe to have a 48TB Windows volume?:

    (BUE failed us so we moved to Storagecraft ShadowProtect which has been flawless to date).

    What is BUE?

    BackUp Exec from Symantec



  • @scottalanmiller said in Safe to have a 48TB Windows volume?:

    @PhlipElder said in Safe to have a 48TB Windows volume?:

    The RAID 10 failure was icing on the cake. Not an emotional reaction, just one that falls into what we've experienced failure wise across the board.

    What math did you use to make a single, very unusual RAID 10 failure lead you to something riskier?

    How can it be non-emotional unless your discovery was that data loss simply didn't affect you and increasing risk was okay to save money on needing fewer disks?

    The statistics for double disk failures. The rebuild rates throw an extra amount of stress on the RAID 10's buddy. That places an extra amount of risk on the table. The number of times the same thing happened in a RAID 1 setting was also a factor.



  • @FATeknollogee said in Safe to have a 48TB Windows volume?:

    @PhlipElder said in Safe to have a 48TB Windows volume?:

    (BUE failed us so we moved to Storagecraft ShadowProtect which has been flawless to date).

    What is BUE?

    Sorry, I should have broken the acronym out. It's Backup Exec at one time by Colorado when it was an awesome product then Symantec when things went downhill from there.



  • @PhlipElder said in Safe to have a 48TB Windows volume?:

    @scottalanmiller said in Safe to have a 48TB Windows volume?:

    @PhlipElder said in Safe to have a 48TB Windows volume?:

    The RAID 10 failure was icing on the cake. Not an emotional reaction, just one that falls into what we've experienced failure wise across the board.

    What math did you use to make a single, very unusual RAID 10 failure lead you to something riskier?

    How can it be non-emotional unless your discovery was that data loss simply didn't affect you and increasing risk was okay to save money on needing fewer disks?

    The statistics for double disk failures. The rebuild rates throw an extra amount of stress on the RAID 10's buddy. That places an extra amount of risk on the table.

    Yet we know that RAID 5/6 rebuild adds stress to every drive in the array instead of just a single drive.



  • @travisdh1 said in Safe to have a 48TB Windows volume?:

    @PhlipElder said in Safe to have a 48TB Windows volume?:

    @scottalanmiller said in Safe to have a 48TB Windows volume?:

    @PhlipElder said in Safe to have a 48TB Windows volume?:

    The RAID 10 failure was icing on the cake. Not an emotional reaction, just one that falls into what we've experienced failure wise across the board.

    What math did you use to make a single, very unusual RAID 10 failure lead you to something riskier?

    How can it be non-emotional unless your discovery was that data loss simply didn't affect you and increasing risk was okay to save money on needing fewer disks?

    The statistics for double disk failures. The rebuild rates throw an extra amount of stress on the RAID 10's buddy. That places an extra amount of risk on the table.

    Yet we know that RAID 5/6 rebuild adds stress to every drive in the array instead of just a single drive.

    Concur, but the stress is a lot more distributed.



  • @PhlipElder said in Safe to have a 48TB Windows volume?:

    @travisdh1 said in Safe to have a 48TB Windows volume?:

    @PhlipElder said in Safe to have a 48TB Windows volume?:

    @scottalanmiller said in Safe to have a 48TB Windows volume?:

    @PhlipElder said in Safe to have a 48TB Windows volume?:

    The RAID 10 failure was icing on the cake. Not an emotional reaction, just one that falls into what we've experienced failure wise across the board.

    What math did you use to make a single, very unusual RAID 10 failure lead you to something riskier?

    How can it be non-emotional unless your discovery was that data loss simply didn't affect you and increasing risk was okay to save money on needing fewer disks?

    The statistics for double disk failures. The rebuild rates throw an extra amount of stress on the RAID 10's buddy. That places an extra amount of risk on the table.

    Yet we know that RAID 5/6 rebuild adds stress to every drive in the array instead of just a single drive.

    Concur, but the stress is a lot more distributed.

    I think you're missing my point. Rebuilding a RAID1 or 10 array stresses 1 drive. Rebuilding a RAID 5 or 6 array stresses every single drive in the array. Stressing 1 drive vs stressing X drives where we know X is always greater than 1.



  • @PhlipElder said in Safe to have a 48TB Windows volume?:

    @scottalanmiller said in Safe to have a 48TB Windows volume?:

    @PhlipElder said in Safe to have a 48TB Windows volume?:

    The RAID 10 failure was icing on the cake. Not an emotional reaction, just one that falls into what we've experienced failure wise across the board.

    What math did you use to make a single, very unusual RAID 10 failure lead you to something riskier?

    How can it be non-emotional unless your discovery was that data loss simply didn't affect you and increasing risk was okay to save money on needing fewer disks?

    The statistics for double disk failures. The rebuild rates throw an extra amount of stress on the RAID 10's buddy. That places an extra amount of risk on the table. The number of times the same thing happened in a RAID 1 setting was also a factor.

    The stress is tiny, so small that the industry hasn't recognized it as a known stress. There has to be some, but it is very small.



  • @PhlipElder said in Safe to have a 48TB Windows volume?:

    @travisdh1 said in Safe to have a 48TB Windows volume?:

    @PhlipElder said in Safe to have a 48TB Windows volume?:

    @scottalanmiller said in Safe to have a 48TB Windows volume?:

    @PhlipElder said in Safe to have a 48TB Windows volume?:

    The RAID 10 failure was icing on the cake. Not an emotional reaction, just one that falls into what we've experienced failure wise across the board.

    What math did you use to make a single, very unusual RAID 10 failure lead you to something riskier?

    How can it be non-emotional unless your discovery was that data loss simply didn't affect you and increasing risk was okay to save money on needing fewer disks?

    The statistics for double disk failures. The rebuild rates throw an extra amount of stress on the RAID 10's buddy. That places an extra amount of risk on the table.

    Yet we know that RAID 5/6 rebuild adds stress to every drive in the array instead of just a single drive.

    Concur, but the stress is a lot more distributed.

    That's not correct. It's not distributed, it is multiplied. It is distributed in the sense that it impacts every drive in the array, yes. But it is more for each drive than for any one drive in the RAID 10. So dramatically so that it's an industry wide recognized stress, while the other is not.

    Also important to remember that that stress has way more chance of breaking something when there are many drives in many different states to affect, and vastly more time over which to affect them, and if any fails, you have issues, whereas with RAID 10 there is only one you are worried about.

    You should not downplay this, it's a very real risk factor.



  • @travisdh1 said in Safe to have a 48TB Windows volume?:

    @PhlipElder said in Safe to have a 48TB Windows volume?:

    @travisdh1 said in Safe to have a 48TB Windows volume?:

    @PhlipElder said in Safe to have a 48TB Windows volume?:

    @scottalanmiller said in Safe to have a 48TB Windows volume?:

    @PhlipElder said in Safe to have a 48TB Windows volume?:

    The RAID 10 failure was icing on the cake. Not an emotional reaction, just one that falls into what we've experienced failure wise across the board.

    What math did you use to make a single, very unusual RAID 10 failure lead you to something riskier?

    How can it be non-emotional unless your discovery was that data loss simply didn't affect you and increasing risk was okay to save money on needing fewer disks?

    The statistics for double disk failures. The rebuild rates throw an extra amount of stress on the RAID 10's buddy. That places an extra amount of risk on the table.

    Yet we know that RAID 5/6 rebuild adds stress to every drive in the array instead of just a single drive.

    Concur, but the stress is a lot more distributed.

    I think you're missing my point. Rebuilding a RAID1 or 10 array stresses 1 drive. Rebuilding a RAID 5 or 6 array stresses every single drive in the array. Stressing 1 drive vs stressing X drives where we know X is always greater than 1.

    And stresses the one very little, instead of all quite a lot.



  • I'm not convinced about the "stress" for a rebuild.

    The drive doesn't mechanically fail because it's build to work under high load. It can't be thermal because drives would slow down if they get too hot. It's not electrical because the drive doesn't die. So what is it then? Isn't it just that we are encounter an unrecoverable read error because so we are statistically bound to have it happen at some point? And because we are reading a lot of bits it's more probable to happen during the rebuild.



  • Just a side note but if you get a double failure on an array all is not lost.

    Thing is that most drives are not dead, they have some bad blocks and they get kicked out of the array and when you don't have enough redundancy the array is shut down.

    So you remove the bad drive and clone it to a new drive on another machine. For instance using dd conv=sync,noerror which means that the drive will cloned as good as it can be and any bad blocks are overwritten with zeroes. Now you can put the new drive back in the array and rebuild the array.

    It will rebuild fine but the file or files that had bad blocks in them will be corrupted and will have to be restored from backup. However the vast majority of the files will be fine.

    In theory you could clone both failed drives. And it's unlikely that both drives have bad blocks in the same location. So theoretically speaking it is very likely that all your data is intact. You would need to use the data from the cloning process to know what blocks where bad on each drive and then the rebuild process would have to take that information into into consideration when rebuilding. Or come to think of it, if you could tell the array that all drives are fine and then do a data scrub and the array would be repaired correctly.



  • @Pete-S said in Safe to have a 48TB Windows volume?:

    Just a side note but if you get a double failure on an array all is not lost.

    Thing is that most drives are not dead, they have some bad blocks and they get kicked out of the array and when you don't have enough redundancy the array is shut down.

    So you remove the bad drive and clone it to a new drive on another machine. For instance using dd conv=sync,noerror which means that the drive will cloned as good as it can be and any bad blocks are overwritten with zeroes. Now you can put the new drive back in the array and rebuild the array.

    It will rebuild fine but the file or files that had bad blocks in them will be corrupted and will have to be restored from backup. However the vast majority of the files will be fine.

    In theory you could clone both failed drives. And it's unlikely that both drives have bad blocks in the same location. So theoretically speaking it is very likely that all your data is intact. You would need to use the data from the cloning process to know what blocks where bad on each drive and then the rebuild process would have to take that information into into consideration when rebuilding. Or come to think of it, if you could tell the array that all drives are fine and then do a data scrub and the array would be repaired correctly.

    If it's a parity RAID, you don't have files on the drive. It's all parity. You get a URE and it's over. If any of that parity data is gone, you can't restore some random parity bit from backup.



  • @Obsolesce said in Safe to have a 48TB Windows volume?:

    @Pete-S said in Safe to have a 48TB Windows volume?:

    Just a side note but if you get a double failure on an array all is not lost.

    Thing is that most drives are not dead, they have some bad blocks and they get kicked out of the array and when you don't have enough redundancy the array is shut down.

    So you remove the bad drive and clone it to a new drive on another machine. For instance using dd conv=sync,noerror which means that the drive will cloned as good as it can be and any bad blocks are overwritten with zeroes. Now you can put the new drive back in the array and rebuild the array.

    It will rebuild fine but the file or files that had bad blocks in them will be corrupted and will have to be restored from backup. However the vast majority of the files will be fine.

    In theory you could clone both failed drives. And it's unlikely that both drives have bad blocks in the same location. So theoretically speaking it is very likely that all your data is intact. You would need to use the data from the cloning process to know what blocks where bad on each drive and then the rebuild process would have to take that information into into consideration when rebuilding. Or come to think of it, if you could tell the array that all drives are fine and then do a data scrub and the array would be repaired correctly.

    If it's a parity RAID, you don't have files on the drive. It's all parity. You get a URE and it's over. If any of that parity data is gone, you can't restore some random parity bit from backup.

    If you mean recovering as good as possible from a double failure I wasn't talking about copying files. I was talking about cloning drives - bit by bit.



  • @Pete-S said in Safe to have a 48TB Windows volume?:

    I'm not convinced about the "stress" for a rebuild.

    It's easy to say. But decades of industry knowledge have accepted it as a known stress. I would not go against that without solid reasoning for it.

    Parity rebuilds quite obviously do a lot more work than mirrored copies, over a longer period of time. So that there is additional stress both per drive and across the array is just common sense. That the evidence of this was so strong that it is common knowledge is just expected given the obvious additional wear and tear that hits it all at once.



  • @Pete-S said in Safe to have a 48TB Windows volume?:

    The drive doesn't mechanically fail because it's build to work under high load. It can't be thermal because drives would slow down if they get too hot. It's not electrical because the drive doesn't die. So what is it then?

    This statement makes no sense. The theory here is that drives don't experience stress, because they are "designed to work under high load." This is simply untrue. Are they "designed to work under high load", sure, whatever that means. But that doesn't change the fact that higher loads create higher stress.

    My car's engine is "designed to work up to the red line". Whatever that means. But running the car near red line creates a lot more stress than running it at slower speeds. So engines running near red line regularly die much earlier than ones that do not. And they are mostly likely to die when at their highest RPMs.

    Basically you are missing two key facts. The first that designed to work under high load is a meaningless phrase, it tells us nothing other than that some stressful situation is planned for, but failures are planned for, too. So that is not useful in determining anything else. And the second is that higher stress is higher stress regardless of what the drive is intended to handle.

    Otherwise, drive failures would be impossible, since no drive is "designed to fail." Since drives do fail, we know that your statement doesn't indicated what you were saying.



  • @Pete-S said in Safe to have a 48TB Windows volume?:

    @Obsolesce said in Safe to have a 48TB Windows volume?:

    @Pete-S said in Safe to have a 48TB Windows volume?:

    Just a side note but if you get a double failure on an array all is not lost.

    Thing is that most drives are not dead, they have some bad blocks and they get kicked out of the array and when you don't have enough redundancy the array is shut down.

    So you remove the bad drive and clone it to a new drive on another machine. For instance using dd conv=sync,noerror which means that the drive will cloned as good as it can be and any bad blocks are overwritten with zeroes. Now you can put the new drive back in the array and rebuild the array.

    It will rebuild fine but the file or files that had bad blocks in them will be corrupted and will have to be restored from backup. However the vast majority of the files will be fine.

    In theory you could clone both failed drives. And it's unlikely that both drives have bad blocks in the same location. So theoretically speaking it is very likely that all your data is intact. You would need to use the data from the cloning process to know what blocks where bad on each drive and then the rebuild process would have to take that information into into consideration when rebuilding. Or come to think of it, if you could tell the array that all drives are fine and then do a data scrub and the array would be repaired correctly.

    If it's a parity RAID, you don't have files on the drive. It's all parity. You get a URE and it's over. If any of that parity data is gone, you can't restore some random parity bit from backup.

    If you mean recovering as good as possible from a double failure I wasn't talking about copying files. I was talking about cloning drives - bit by bit.

    That can work. But often fails. How often have you seen a parity array recover from that?

    And when you do see it happen, how long does that typically take? Time to recover is a factor that is often overlooked in reliability equations. But in the real world, especially with parity, we often have to discuss with customers "given how long this is expected to take to recover, would it not be better to have just lost the data and started over?"

    When downtime is days, or weeks, or even months to "protect against data loss", often you just have to give up because it took too long.



  • @scottalanmiller said in Safe to have a 48TB Windows volume?:

    @Pete-S said in Safe to have a 48TB Windows volume?:

    I'm not convinced about the "stress" for a rebuild.

    It's easy to say. But decades of industry knowledge have accepted it as a known stress. I would not go against that without solid reasoning for it.

    Parity rebuilds quite obviously do a lot more work than mirrored copies, over a longer period of time. So that there is additional stress both per drive and across the array is just common sense. That the evidence of this was so strong that it is common knowledge is just expected given the obvious additional wear and tear that hits it all at once.

    Yes, it more drives involved as well which increases the risk but is the second failure really more statistically prone to happen during the rebuild process than if you would read the same amount of data from the same number of drives?

    Have there been so many double failures during rebuild that there is actual data on this? The only thing I've seen has been calculations involving just the number of bits and the likely hood of URE.



  • @Pete-S said in Safe to have a 48TB Windows volume?:

    @scottalanmiller said in Safe to have a 48TB Windows volume?:

    @Pete-S said in Safe to have a 48TB Windows volume?:

    I'm not convinced about the "stress" for a rebuild.

    It's easy to say. But decades of industry knowledge have accepted it as a known stress. I would not go against that without solid reasoning for it.

    Parity rebuilds quite obviously do a lot more work than mirrored copies, over a longer period of time. So that there is additional stress both per drive and across the array is just common sense. That the evidence of this was so strong that it is common knowledge is just expected given the obvious additional wear and tear that hits it all at once.

    Yes, it more drives involved as well which increases the risk but is the second failure really more statistically prone to happen during the rebuild process than if you would read the same amount of data from the same number of drives?

    I'm not the one that has done the study on this. But it is accepted as commonly known to be statistically significant. So yes, absolutely, 100%... the industry believes that you get a much higher failure rate during that time.



  • @Pete-S said in Safe to have a 48TB Windows volume?:

    Have there been so many double failures during rebuild that there is actual data on this? The only thing I've seen has been calculations involving just the number of bits and the likely hood of URE.

    It's high enough that it is considered to be true industry wide. It cannot be calculated, because it's not a simply equation like a URE. There are many factors and mechanical stress is a far more subtle type thing. UREs are a nearly stable rate (but not actually) and we use a common number to know the failure rate under ideal conditions.

    Mechanical wear increase rates are insanely hard to get math on. The same goes for the URE increase under the same stress. The problem is, both increase enough to be a real problem.



  • @scottalanmiller said in Safe to have a 48TB Windows volume?:

    @Pete-S said in Safe to have a 48TB Windows volume?:

    @Obsolesce said in Safe to have a 48TB Windows volume?:

    @Pete-S said in Safe to have a 48TB Windows volume?:

    Just a side note but if you get a double failure on an array all is not lost.

    Thing is that most drives are not dead, they have some bad blocks and they get kicked out of the array and when you don't have enough redundancy the array is shut down.

    So you remove the bad drive and clone it to a new drive on another machine. For instance using dd conv=sync,noerror which means that the drive will cloned as good as it can be and any bad blocks are overwritten with zeroes. Now you can put the new drive back in the array and rebuild the array.

    It will rebuild fine but the file or files that had bad blocks in them will be corrupted and will have to be restored from backup. However the vast majority of the files will be fine.

    In theory you could clone both failed drives. And it's unlikely that both drives have bad blocks in the same location. So theoretically speaking it is very likely that all your data is intact. You would need to use the data from the cloning process to know what blocks where bad on each drive and then the rebuild process would have to take that information into into consideration when rebuilding. Or come to think of it, if you could tell the array that all drives are fine and then do a data scrub and the array would be repaired correctly.

    If it's a parity RAID, you don't have files on the drive. It's all parity. You get a URE and it's over. If any of that parity data is gone, you can't restore some random parity bit from backup.

    If you mean recovering as good as possible from a double failure I wasn't talking about copying files. I was talking about cloning drives - bit by bit.

    That can work. But often fails. How often have you seen a parity array recover from that?

    And when you do see it happen, how long does that typically take? Time to recover is a factor that is often overlooked in reliability equations. But in the real world, especially with parity, we often have to discuss with customers "given how long this is expected to take to recover, would it not be better to have just lost the data and started over?"

    When downtime is days, or weeks, or even months to "protect against data loss", often you just have to give up because it took too long.

    I've only had to do it once myself. RAID5 with six drives I think. I think it was 4 TB drives so I guess a 20TB array. The second drive failed during rebuild. Cloned the second failed drive and rebuilt again. There was a problem with the backup so it was more a question of getting as much as possible up again and then figure things out.

    But speed is interesting and that should be one of the primary factors when coming up with a strategy on how to backup and restore.



  • @scottalanmiller said in Safe to have a 48TB Windows volume?:

    @Pete-S said in Safe to have a 48TB Windows volume?:

    The drive doesn't mechanically fail because it's build to work under high load. It can't be thermal because drives would slow down if they get too hot. It's not electrical because the drive doesn't die. So what is it then?

    This statement makes no sense. The theory here is that drives don't experience stress, because they are "designed to work under high load." This is simply untrue. Are they "designed to work under high load", sure, whatever that means. But that doesn't change the fact that higher loads create higher stress.

    My car's engine is "designed to work up to the red line". Whatever that means. But running the car near red line creates a lot more stress than running it at slower speeds. So engines running near red line regularly die much earlier than ones that do not. And they are mostly likely to die when at their highest RPMs.

    Basically you are missing two key facts. The first that designed to work under high load is a meaningless phrase, it tells us nothing other than that some stressful situation is planned for, but failures are planned for, too. So that is not useful in determining anything else. And the second is that higher stress is higher stress regardless of what the drive is intended to handle.

    Otherwise, drive failures would be impossible, since no drive is "designed to fail." Since drives do fail, we know that your statement doesn't indicated what you were saying.

    If we look at the mechanical stress a rebuild is not so bad because it's sequential reading and writing. The disks are spinning regardless if you are reading or writing or doing nothing so that makes no difference. And the arms don't have move all over the place when you're doing sequential block operations.

    But what I was implying is that the second drive doesn't fail for a mechanical reason because that would be the end of the drive. Isn't that your experience as well, that you end up with bad blocks but not dead drives?



  • @PhlipElder said in Safe to have a 48TB Windows volume?:

    @FATeknollogee said in Safe to have a 48TB Windows volume?:

    @PhlipElder said in Safe to have a 48TB Windows volume?:

    (BUE failed us so we moved to Storagecraft ShadowProtect which has been flawless to date).

    What is BUE?

    Sorry, I should have broken the acronym out. It's Backup Exec at one time by Colorado when it was an awesome product then Symantec when things went downhill from there.

    Now they are Veritas just in case you wonder.


Log in to reply