Safe to have a 48TB Windows volume?

scottalanmiller

@PhlipElder said in Safe to have a 48TB Windows volume?:

The only answer that has any value to me at this point is this: Are the backups taken test restored to bare-metal or bare-hypervisor? Has your hyper-scale whatever been tested to failover without data loss?

I think this is a terrible approach. This leads to creating systems that mathematically or statistically we'd expect to fail. If this was our true thought process, we'd skip tried and true systems like RAID, because we'd not trust them (even with studies that show how reliable that they are) because we are connecting them to some unethical SAN vendor who made false reliability stats and hides all failures from the public to trick us. We can't allow an emotional reaction to having sales people try to trick us with clearly false data lead us to do something dangerous.

There is a lot of real, non-vendor, information out there in the industry. And a lot of just common sense. And some real studies on reliability that are actually based on math. We don't have to be blind or emotional. With good math, observation, elimination of marketing information, logic, and common sense... we can have a really good starting point. Are we still partially blind? Of course. But can we start from an educated point with a low level of risk? Absolutely.

Basically, just because you can still have an accident doesn't mean that you shouldn't keep wearing your seatbelt and avoid hitting pot holes.

scottalanmiller

Some examples of things we have math to tell us are good or bad...

RAID 10 .... we've done massive empirical studies. We know that the RAID systems themselves are insanely reliable.
Cheap SAN like the P2000 .... we know that by collecting anecdotes, and knowing total sales figures, that the failure rates of those observed alone is too high for the entire existing set of products made, and we can safely assume that the number we have not observed is vastly higher. But observation alone tells us that the reliability is not high enough for any production use.

PhlipElder

@scottalanmiller said in Safe to have a 48TB Windows volume?:

@PhlipElder said in Safe to have a 48TB Windows volume?:

@scottalanmiller said in Safe to have a 48TB Windows volume?:

@Obsolesce said in Safe to have a 48TB Windows volume?:

@scottalanmiller said in Safe to have a 48TB Windows volume?:

@Obsolesce said in Safe to have a 48TB Windows volume?:

@jim9500 said in Safe to have a 48TB Windows volume?:

Have any of you used 48TB Windows volumes? Any resources on risk analysis vs ZFS?

I have two that are close to 60 TB. But they are REFS and hold a lot of large virtual disks.

REFS on 2019 is what I would wait for, for bare file storage.

Are you on 2019 now or looking to move off of a Windows file server?

ReFS has a bad track record. It's got a future, but has been pretty lacking and presents a bit of risk. Microsoft has had a disastrous track record with storage recently, even if ReFS is supposed to get brought to production levels with 2019, 2019 is questionably production ready. Remember... data loss is why it was pulled out of production in the first place.

It's been great in my experience. Though, I am using it in such a way the risk is worth the benefits... replication and backup repositories. It's been 100% solid. And like I said, it's all huge files stored on it, and probably not the use case that you seen results in data loss. I haven't seen that anywhere, so only taking your word for it unless you have links for me to do some reading. Not dumb stuff from Tom's or whatever, reputable scenarios in correct use cases.

The problem with storage is that we expect durability of something like seven nines as a "minimum" for being production ready. That means no matter how many people having "good experiences" with it, that tells us nothing. It's the people having issues with it that matter. And ReFS lacks the stability, safety, and recoverability necessary for it to be considered production ready to normal people as a baseline.

But even systems that lose data 90% of the time, work perfectly for 10% of people.

The problem I have with this perspective is that some of us have more direct contacts with folks that have had their SAN storage blow up on them but nothing gets seen in the public. One that does come to mind is the Australian Government's very public SAN blow-out a few years ago.

There is no solution out there that's perfect. None. Nadda. Zippo. Zilch.

All solutions blow up, have failures, lose data, and outright stop working.

Thus, in my mind citing up-time, reliability, or any other such statistic is a moot point. It's essentially useless.

Not at all. Reliability stats are SUPER important. There's ton of value. When we are dealing with systems expecting durability like this, those stats tell us a wealth of information. You can't dismiss the only data we have on reliability. It's far from useless.

BackBlaze is probably the only vendor I can think of that has told the drive vendors to take a flying leap and published what I consider to be real reliability statistics.

There are vendors, VMware for vSAN and Nutanix come to mind, that have specific NDAs in place that block any mention of their product's reliability and performance.

Drive vendors also have a similar clause but note BackBlaze.

Other than BackBlaze, the reliability statistics that I can find reliable are the ones that we have based on all of the solution sets we've built and deployed or worked with over the years. Those numbers tell a pretty good story. But, so too do the statistics that come about as result of the aforementioned panicked phone call.

Anything else in the public sphere has about the same weight as CRN, PCMag, ConsumerReports, or any other marketing fluff type.

PhlipElder

@scottalanmiller said in Safe to have a 48TB Windows volume?:

Some examples of things we have math to tell us are good or bad...

RAID 10 .... we've done massive empirical studies. We know that the RAID systems themselves are insanely reliable.
Cheap SAN like the P2000 .... we know that by collecting anecdotes, and knowing total sales figures, that the failure rates of those observed alone is too high for the entire existing set of products made, and we can safely assume that the number we have not observed is vastly higher. But observation alone tells us that the reliability is not high enough for any production use.

We lost an entire virtualization platform and had to recover from scratch because the second member of a RAID 10 pair failed after replacing the first and a rebuild initiating. We'll stick with RAID 6 thanks.

EDIT: The on-site IT and I were well into our coffee chat when the spontaneous beep/beep happened and we were both, WTF?

scottalanmiller

@PhlipElder said in Safe to have a 48TB Windows volume?:

@scottalanmiller said in Safe to have a 48TB Windows volume?:

@PhlipElder said in Safe to have a 48TB Windows volume?:

@scottalanmiller said in Safe to have a 48TB Windows volume?:

@Obsolesce said in Safe to have a 48TB Windows volume?:

@scottalanmiller said in Safe to have a 48TB Windows volume?:

@Obsolesce said in Safe to have a 48TB Windows volume?:

@jim9500 said in Safe to have a 48TB Windows volume?:

Have any of you used 48TB Windows volumes? Any resources on risk analysis vs ZFS?

I have two that are close to 60 TB. But they are REFS and hold a lot of large virtual disks.

REFS on 2019 is what I would wait for, for bare file storage.

Are you on 2019 now or looking to move off of a Windows file server?

ReFS has a bad track record. It's got a future, but has been pretty lacking and presents a bit of risk. Microsoft has had a disastrous track record with storage recently, even if ReFS is supposed to get brought to production levels with 2019, 2019 is questionably production ready. Remember... data loss is why it was pulled out of production in the first place.

It's been great in my experience. Though, I am using it in such a way the risk is worth the benefits... replication and backup repositories. It's been 100% solid. And like I said, it's all huge files stored on it, and probably not the use case that you seen results in data loss. I haven't seen that anywhere, so only taking your word for it unless you have links for me to do some reading. Not dumb stuff from Tom's or whatever, reputable scenarios in correct use cases.

The problem with storage is that we expect durability of something like seven nines as a "minimum" for being production ready. That means no matter how many people having "good experiences" with it, that tells us nothing. It's the people having issues with it that matter. And ReFS lacks the stability, safety, and recoverability necessary for it to be considered production ready to normal people as a baseline.

But even systems that lose data 90% of the time, work perfectly for 10% of people.

The problem I have with this perspective is that some of us have more direct contacts with folks that have had their SAN storage blow up on them but nothing gets seen in the public. One that does come to mind is the Australian Government's very public SAN blow-out a few years ago.

There is no solution out there that's perfect. None. Nadda. Zippo. Zilch.

All solutions blow up, have failures, lose data, and outright stop working.

Thus, in my mind citing up-time, reliability, or any other such statistic is a moot point. It's essentially useless.

Not at all. Reliability stats are SUPER important. There's ton of value. When we are dealing with systems expecting durability like this, those stats tell us a wealth of information. You can't dismiss the only data we have on reliability. It's far from useless.

BackBlaze is probably the only vendor I can think of that has told the drive vendors to take a flying leap and published what I consider to be real reliability statistics.

There are vendors, VMware for vSAN and Nutanix come to mind, that have specific NDAs in place that block any mention of their product's reliability and performance.

Drive vendors also have a similar clause but note BackBlaze.

Other than BackBlaze, the reliability statistics that I can find reliable are the ones that we have based on all of the solution sets we've built and deployed or worked with over the years. Those numbers tell a pretty good story. But, so too do the statistics that come about as result of the aforementioned panicked phone call.

Anything else in the public sphere has about the same weight as CRN, PCMag, ConsumerReports, or any other marketing fluff type.

Needing someone else to do a study for you is part of the issue. I myself have done the largest RAID study I've ever heard of (over 80,000 array years.) And we don't need a third party to do studies of some SAN systems, for example.

Sure, there are loads of things we have to be blind to. But there is a ton that we know, and a ton that we can reasonable extrapolate.

We have a lot more information than people give us credit for. But people tend to focus on the lack of big vendors doing big studies, which sadly are just impossible to have happen. We expect reliability rates so high that often you can't study them on products, ever. We simply don't make and run products long enough for even the vendors to know.

scottalanmiller

@PhlipElder said in Safe to have a 48TB Windows volume?:

@scottalanmiller said in Safe to have a 48TB Windows volume?:

Some examples of things we have math to tell us are good or bad...

RAID 10 .... we've done massive empirical studies. We know that the RAID systems themselves are insanely reliable.
Cheap SAN like the P2000 .... we know that by collecting anecdotes, and knowing total sales figures, that the failure rates of those observed alone is too high for the entire existing set of products made, and we can safely assume that the number we have not observed is vastly higher. But observation alone tells us that the reliability is not high enough for any production use.

We lost an entire virtualization platform and had to recover from scratch because the second member of a RAID 10 pair failed after replacing the first and a rebuild initiating. We'll stick with RAID 6 thanks.

EDIT: The on-site IT and I were well into our coffee chat when the spontaneous beep/beep happened and we were both, WTF?

See, that's an irrational, emotional reaction that we are trying to avoid. You have one anecdote that tells you nothing, but you make a decision based on it that goes against math and empirical studies. Why?

And even the anecdote doesn't tell you that RAID 6 would have protected you. Only that RAID 10 wasn't able to.

Had you used RAID 6, it might have failed too, possibly worse, and we'd be having the opposite conversation about how you can never trust RAID 6.

Bottom line, using individual anecdotes for answers is the one thing we know is bad to do.

Dashrender

@scottalanmiller said in Safe to have a 48TB Windows volume?:

@PhlipElder said in Safe to have a 48TB Windows volume?:

@scottalanmiller said in Safe to have a 48TB Windows volume?:

Some examples of things we have math to tell us are good or bad...

RAID 10 .... we've done massive empirical studies. We know that the RAID systems themselves are insanely reliable.
Cheap SAN like the P2000 .... we know that by collecting anecdotes, and knowing total sales figures, that the failure rates of those observed alone is too high for the entire existing set of products made, and we can safely assume that the number we have not observed is vastly higher. But observation alone tells us that the reliability is not high enough for any production use.

We lost an entire virtualization platform and had to recover from scratch because the second member of a RAID 10 pair failed after replacing the first and a rebuild initiating. We'll stick with RAID 6 thanks.

EDIT: The on-site IT and I were well into our coffee chat when the spontaneous beep/beep happened and we were both, WTF?

See, that's an irrational, emotional reaction that we are trying to avoid. You have one anecdote that tells you nothing, but you make a decision based on it that goes against math and empirical studies. Why?

And even the anecdote doesn't tell you that RAID 6 would have protected you. Only that RAID 10 wasn't able to.

apparently he needed the 3 drive RAID 10 pairs that other guy was running.

PhlipElder

@scottalanmiller said in Safe to have a 48TB Windows volume?:

@PhlipElder said in Safe to have a 48TB Windows volume?:

@scottalanmiller said in Safe to have a 48TB Windows volume?:

Some examples of things we have math to tell us are good or bad...

RAID 10 .... we've done massive empirical studies. We know that the RAID systems themselves are insanely reliable.
Cheap SAN like the P2000 .... we know that by collecting anecdotes, and knowing total sales figures, that the failure rates of those observed alone is too high for the entire existing set of products made, and we can safely assume that the number we have not observed is vastly higher. But observation alone tells us that the reliability is not high enough for any production use.

We lost an entire virtualization platform and had to recover from scratch because the second member of a RAID 10 pair failed after replacing the first and a rebuild initiating. We'll stick with RAID 6 thanks.

EDIT: The on-site IT and I were well into our coffee chat when the spontaneous beep/beep happened and we were both, WTF?

See, that's an irrational, emotional reaction that we are trying to avoid. You have one anecdote that tells you nothing, but you make a decision based on it that goes against math and empirical studies. Why?

The fact that you and possibly your org has actually studied things is important to the discussion.

We've had enough double disk failures over time to have influenced the decision to drop RAID 5. The RAID 10 failure was icing on the cake. Not an emotional reaction, just one that falls into what we've experienced failure wise across the board.

scottalanmiller

@PhlipElder said in Safe to have a 48TB Windows volume?:

@scottalanmiller said in Safe to have a 48TB Windows volume?:

@PhlipElder said in Safe to have a 48TB Windows volume?:

@scottalanmiller said in Safe to have a 48TB Windows volume?:

Some examples of things we have math to tell us are good or bad...

RAID 10 .... we've done massive empirical studies. We know that the RAID systems themselves are insanely reliable.
Cheap SAN like the P2000 .... we know that by collecting anecdotes, and knowing total sales figures, that the failure rates of those observed alone is too high for the entire existing set of products made, and we can safely assume that the number we have not observed is vastly higher. But observation alone tells us that the reliability is not high enough for any production use.

We lost an entire virtualization platform and had to recover from scratch because the second member of a RAID 10 pair failed after replacing the first and a rebuild initiating. We'll stick with RAID 6 thanks.

EDIT: The on-site IT and I were well into our coffee chat when the spontaneous beep/beep happened and we were both, WTF?

See, that's an irrational, emotional reaction that we are trying to avoid. You have one anecdote that tells you nothing, but you make a decision based on it that goes against math and empirical studies. Why?

The fact that you and possibly your org has actually studied things is important to the discussion.

I've published about it and speak about it all the time. The study was massive. And took forever. As you can imagine.

scottalanmiller

@PhlipElder said in Safe to have a 48TB Windows volume?:

The RAID 10 failure was icing on the cake. Not an emotional reaction, just one that falls into what we've experienced failure wise across the board.

What math did you use to make a single, very unusual RAID 10 failure lead you to something riskier?

How can it be non-emotional unless your discovery was that data loss simply didn't affect you and increasing risk was okay to save money on needing fewer disks?

scottalanmiller

With lots of double disk failures, the real thing you need to be looking at is the disks that you have or the environment that they are in. RAID 5 carries huge risk, but it shouldn't primarily be from double disk failures. That that is what led you away from RAID 5 should have been a red flag that something else was wrong. Double disk failure can happen to anyone, of course, but lots of them indicates a trend that isn't RAID related.

PhlipElder

@scottalanmiller said in Safe to have a 48TB Windows volume?:

@PhlipElder said in Safe to have a 48TB Windows volume?:

@scottalanmiller said in Safe to have a 48TB Windows volume?:

@PhlipElder said in Safe to have a 48TB Windows volume?:

@scottalanmiller said in Safe to have a 48TB Windows volume?:

Some examples of things we have math to tell us are good or bad...

RAID 10 .... we've done massive empirical studies. We know that the RAID systems themselves are insanely reliable.
Cheap SAN like the P2000 .... we know that by collecting anecdotes, and knowing total sales figures, that the failure rates of those observed alone is too high for the entire existing set of products made, and we can safely assume that the number we have not observed is vastly higher. But observation alone tells us that the reliability is not high enough for any production use.

We lost an entire virtualization platform and had to recover from scratch because the second member of a RAID 10 pair failed after replacing the first and a rebuild initiating. We'll stick with RAID 6 thanks.

EDIT: The on-site IT and I were well into our coffee chat when the spontaneous beep/beep happened and we were both, WTF?

See, that's an irrational, emotional reaction that we are trying to avoid. You have one anecdote that tells you nothing, but you make a decision based on it that goes against math and empirical studies. Why?

The fact that you and possibly your org has actually studied things is important to the discussion.

I've published about it and speak about it all the time. The study was massive. And took forever. As you can imagine.

One of the reasons we adopted Storage Spaces as a platform was because of the auto-retire and rebuild into free pool space via parallel rebuild. With, at that time, 2TB and larger drives becoming all the more common rebuild times on the RAID controller were taking a long time to happen.

Parallel rebuild into free pool space rebuilds that dead disk across all members in the pool. So, 2TB gets done in minutes instead of hours/days. Plus, once that dead drive's contents is rebuilt into free pool space so long as there is more free pool space to be had (size of disk + ~150GB) another disk failure can happen and still maintain the two disk resilience (Dual Parity or 3-Way Mirror).

RAID can't do that for us.

scottalanmiller

@PhlipElder said in Safe to have a 48TB Windows volume?:

@scottalanmiller said in Safe to have a 48TB Windows volume?:

@PhlipElder said in Safe to have a 48TB Windows volume?:

@scottalanmiller said in Safe to have a 48TB Windows volume?:

@PhlipElder said in Safe to have a 48TB Windows volume?:

@scottalanmiller said in Safe to have a 48TB Windows volume?:

Some examples of things we have math to tell us are good or bad...

RAID 10 .... we've done massive empirical studies. We know that the RAID systems themselves are insanely reliable.
Cheap SAN like the P2000 .... we know that by collecting anecdotes, and knowing total sales figures, that the failure rates of those observed alone is too high for the entire existing set of products made, and we can safely assume that the number we have not observed is vastly higher. But observation alone tells us that the reliability is not high enough for any production use.

We lost an entire virtualization platform and had to recover from scratch because the second member of a RAID 10 pair failed after replacing the first and a rebuild initiating. We'll stick with RAID 6 thanks.

EDIT: The on-site IT and I were well into our coffee chat when the spontaneous beep/beep happened and we were both, WTF?

See, that's an irrational, emotional reaction that we are trying to avoid. You have one anecdote that tells you nothing, but you make a decision based on it that goes against math and empirical studies. Why?

The fact that you and possibly your org has actually studied things is important to the discussion.

I've published about it and speak about it all the time. The study was massive. And took forever. As you can imagine.

One of the reasons we adopted Storage Spaces as a platform was because of the auto-retire and rebuild into free pool space via parallel rebuild. With, at that time, 2TB and larger drives becoming all the more common rebuild times on the RAID controller were taking a long time to happen.

RAID 1 / 10 rebuilds are generally... acceptable. But if you were choosing RAID 5 or 6, then the rebuilds are absurd. But we've known that for a long time. It just takes so much work to rebuild a system of that nature.

But this goes into the earlier discussion, if you were using math rather than emotions before moving to RAIN, it seems it would have kept you to RAID 1 or 10 all along.

PhlipElder

@scottalanmiller said in Safe to have a 48TB Windows volume?:

With lots of double disk failures, the real thing you need to be looking at is the disks that you have or the environment that they are in. RAID 5 carries huge risk, but it shouldn't primarily be from double disk failures. That that is what led you away from RAID 5 should have been a red flag that something else was wrong. Double disk failure can happen to anyone, of course, but lots of them indicates a trend that isn't RAID related.

One was environment. The site had the HVAC above the ceiling tiles all messed up with primary paths not capped. So, air return did not work and A/C in the summer stayed above the ceiling tiles and heat in the winter as well. The server closet during the winter could easily hit 40C. There were no more circuits available anywhere in the leased space so we couldn't even get a portable A/C in there.

We experienced four, count them, four catastrophic failures at that site. The owners knew why but we were helpless against it. So, we build-out a highly available system using two servers, third party products, and a really good backup set (BUE failed us so we moved to Storagecraft ShadowProtect which has been flawless to date).

There's statistics. Then there's d*mned statistics.

scottalanmiller

@PhlipElder said in Safe to have a 48TB Windows volume?:

Parallel rebuild into free pool space rebuilds that dead disk across all members in the pool. So, 2TB gets done in minutes instead of hours/days. Plus, once that dead drive's contents is rebuilt into free pool space so long as there is more free pool space to be had (size of disk + ~150GB) another disk failure can happen and still maintain the two disk resilience (Dual Parity or 3-Way Mirror).

RAID can't do that for us.

Absolutely, this is a huge reason why RAIN has been replacing RAID for a long time. We've had that for many years. Large capacity is making RAID simply ineffective, no surprises there. "Shuffling" data around as needed is a powerful tool.

Technically, RAID can do this, but does it very poorly. It's a feature of hybrid RAID.

PhlipElder

@scottalanmiller said in Safe to have a 48TB Windows volume?:

@PhlipElder said in Safe to have a 48TB Windows volume?:

Parallel rebuild into free pool space rebuilds that dead disk across all members in the pool. So, 2TB gets done in minutes instead of hours/days. Plus, once that dead drive's contents is rebuilt into free pool space so long as there is more free pool space to be had (size of disk + ~150GB) another disk failure can happen and still maintain the two disk resilience (Dual Parity or 3-Way Mirror).

RAID can't do that for us.

Absolutely, this is a huge reason why RAIN has been replacing RAID for a long time. We've had that for many years. Large capacity is making RAID simply ineffective, no surprises there. "Shuffling" data around as needed is a powerful tool.

Technically, RAID can do this, but does it very poorly. It's a feature of hybrid RAID.

We're seeing the same thing in Solid-State now too. As SSD vendors deliver larger and larger capacity devices the write speeds all of a sudden become a limiting factor. Go figure. :S

scottalanmiller

@PhlipElder said in Safe to have a 48TB Windows volume?:

@scottalanmiller said in Safe to have a 48TB Windows volume?:

@PhlipElder said in Safe to have a 48TB Windows volume?:

Parallel rebuild into free pool space rebuilds that dead disk across all members in the pool. So, 2TB gets done in minutes instead of hours/days. Plus, once that dead drive's contents is rebuilt into free pool space so long as there is more free pool space to be had (size of disk + ~150GB) another disk failure can happen and still maintain the two disk resilience (Dual Parity or 3-Way Mirror).

RAID can't do that for us.

Absolutely, this is a huge reason why RAIN has been replacing RAID for a long time. We've had that for many years. Large capacity is making RAID simply ineffective, no surprises there. "Shuffling" data around as needed is a powerful tool.

Technically, RAID can do this, but does it very poorly. It's a feature of hybrid RAID.

We're seeing the same thing in Solid-State now too. As SSD vendors deliver larger and larger capacity devices the write speeds all of a sudden become a limiting factor. Go figure. :S

Yes, RAID will unlikely ever make a large come back. The scale of storage in the future simply makes device-centric protection ineffecitve long term.

FATeknollogee

@PhlipElder said in Safe to have a 48TB Windows volume?:

(BUE failed us so we moved to Storagecraft ShadowProtect which has been flawless to date).

What is BUE?

scottalanmiller

@FATeknollogee said in Safe to have a 48TB Windows volume?:

@PhlipElder said in Safe to have a 48TB Windows volume?:

(BUE failed us so we moved to Storagecraft ShadowProtect which has been flawless to date).

What is BUE?

BackUp Exec from Symantec

PhlipElder

@scottalanmiller said in Safe to have a 48TB Windows volume?:

@PhlipElder said in Safe to have a 48TB Windows volume?:

The RAID 10 failure was icing on the cake. Not an emotional reaction, just one that falls into what we've experienced failure wise across the board.

What math did you use to make a single, very unusual RAID 10 failure lead you to something riskier?

How can it be non-emotional unless your discovery was that data loss simply didn't affect you and increasing risk was okay to save money on needing fewer disks?

The statistics for double disk failures. The rebuild rates throw an extra amount of stress on the RAID 10's buddy. That places an extra amount of risk on the table. The number of times the same thing happened in a RAID 1 setting was also a factor.