Safe to have a 48TB Windows volume?
-
@jim9500 said in Safe to have a 48TB Windows volume?:
Have any of you used 48TB Windows volumes? Any resources on risk analysis vs ZFS?
I have two that are close to 60 TB. But they are REFS and hold a lot of large virtual disks.
REFS on 2019 is what I would wait for, for bare file storage.
Are you on 2019 now or looking to move off of a Windows file server?
-
@Obsolesce said in Safe to have a 48TB Windows volume?:
@jim9500 said in Safe to have a 48TB Windows volume?:
Have any of you used 48TB Windows volumes? Any resources on risk analysis vs ZFS?
I have two that are close to 60 TB. But they are REFS and hold a lot of large virtual disks.
REFS on 2019 is what I would wait for, for bare file storage.
Are you on 2019 now or looking to move off of a Windows file server?
ReFS has a bad track record. It's got a future, but has been pretty lacking and presents a bit of risk. Microsoft has had a disastrous track record with storage recently, even if ReFS is supposed to get brought to production levels with 2019, 2019 is questionably production ready. Remember... data loss is why it was pulled out of production in the first place.
-
@scottalanmiller said in Safe to have a 48TB Windows volume?:
@Obsolesce said in Safe to have a 48TB Windows volume?:
@jim9500 said in Safe to have a 48TB Windows volume?:
Have any of you used 48TB Windows volumes? Any resources on risk analysis vs ZFS?
I have two that are close to 60 TB. But they are REFS and hold a lot of large virtual disks.
REFS on 2019 is what I would wait for, for bare file storage.
Are you on 2019 now or looking to move off of a Windows file server?
ReFS has a bad track record. It's got a future, but has been pretty lacking and presents a bit of risk. Microsoft has had a disastrous track record with storage recently, even if ReFS is supposed to get brought to production levels with 2019, 2019 is questionably production ready. Remember... data loss is why it was pulled out of production in the first place.
It's been great in my experience. Though, I am using it in such a way the risk is worth the benefits... replication and backup repositories. It's been 100% solid. And like I said, it's all huge files stored on it, and probably not the use case that you seen results in data loss. I haven't seen that anywhere, so only taking your word for it unless you have links for me to do some reading. Not dumb stuff from Tom's or whatever, reputable scenarios in correct use cases.
-
https://docs.microsoft.com/en-us/windows-server/storage/refs/refs-overview
All I need from it is g2g.
-
Run a Chkdsk in that volume can take days..
-
@scottalanmiller said in Safe to have a 48TB Windows volume?:
@Obsolesce said in Safe to have a 48TB Windows volume?:
@jim9500 said in Safe to have a 48TB Windows volume?:
Have any of you used 48TB Windows volumes? Any resources on risk analysis vs ZFS?
I have two that are close to 60 TB. But they are REFS and hold a lot of large virtual disks.
REFS on 2019 is what I would wait for, for bare file storage.
Are you on 2019 now or looking to move off of a Windows file server?
ReFS has a bad track record. It's got a future, but has been pretty lacking and presents a bit of risk. Microsoft has had a disastrous track record with storage recently, even if ReFS is supposed to get brought to production levels with 2019, 2019 is questionably production ready. Remember... data loss is why it was pulled out of production in the first place.
If you're talking about why 2019 (and Windows 10 1809) were pulled, that data loss has nothing to do with REFS. Additionally, REFS was removed from Windows 10 for all versions exception workstation.
-
@Dashrender said in Safe to have a 48TB Windows volume?:
@scottalanmiller said in Safe to have a 48TB Windows volume?:
@Obsolesce said in Safe to have a 48TB Windows volume?:
@jim9500 said in Safe to have a 48TB Windows volume?:
Have any of you used 48TB Windows volumes? Any resources on risk analysis vs ZFS?
I have two that are close to 60 TB. But they are REFS and hold a lot of large virtual disks.
REFS on 2019 is what I would wait for, for bare file storage.
Are you on 2019 now or looking to move off of a Windows file server?
ReFS has a bad track record. It's got a future, but has been pretty lacking and presents a bit of risk. Microsoft has had a disastrous track record with storage recently, even if ReFS is supposed to get brought to production levels with 2019, 2019 is questionably production ready. Remember... data loss is why it was pulled out of production in the first place.
If you're talking about why 2019 (and Windows 10 1809) were pulled, that data loss has nothing to do with REFS. Additionally, REFS was removed from Windows 10 for all versions exception workstation.
I never said it did. Why would it need to be? There are issues with Microsoft and storage in general, problems with ReFS in general, and problems with 2019 in regards to storage. What more do you need to be wary?
-
@Obsolesce said in Safe to have a 48TB Windows volume?:
@scottalanmiller said in Safe to have a 48TB Windows volume?:
@Obsolesce said in Safe to have a 48TB Windows volume?:
@jim9500 said in Safe to have a 48TB Windows volume?:
Have any of you used 48TB Windows volumes? Any resources on risk analysis vs ZFS?
I have two that are close to 60 TB. But they are REFS and hold a lot of large virtual disks.
REFS on 2019 is what I would wait for, for bare file storage.
Are you on 2019 now or looking to move off of a Windows file server?
ReFS has a bad track record. It's got a future, but has been pretty lacking and presents a bit of risk. Microsoft has had a disastrous track record with storage recently, even if ReFS is supposed to get brought to production levels with 2019, 2019 is questionably production ready. Remember... data loss is why it was pulled out of production in the first place.
It's been great in my experience. Though, I am using it in such a way the risk is worth the benefits... replication and backup repositories. It's been 100% solid. And like I said, it's all huge files stored on it, and probably not the use case that you seen results in data loss. I haven't seen that anywhere, so only taking your word for it unless you have links for me to do some reading. Not dumb stuff from Tom's or whatever, reputable scenarios in correct use cases.
The problem with storage is that we expect durability of something like seven nines as a "minimum" for being production ready. That means no matter how many people having "good experiences" with it, that tells us nothing. It's the people having issues with it that matter. And ReFS lacks the stability, safety, and recoverability necessary for it to be considered production ready to normal people as a baseline.
But even systems that lose data 90% of the time, work perfectly for 10% of people.
-
@Obsolesce said in Safe to have a 48TB Windows volume?:
@jim9500 said in Safe to have a 48TB Windows volume?:
Have any of you used 48TB Windows volumes? Any resources on risk analysis vs ZFS?
I have two that are close to 60 TB. But they are REFS and hold a lot of large virtual disks.
REFS on 2019 is what I would wait for, for bare file storage.
Are you on 2019 now or looking to move off of a Windows file server?
ReFS is supported for production workloads on Storage Spaces Direct and Storage Spaces. With the Server 2019 ReFS generation Microsoft has relented to some degree and stated that ReFS can be done on SAN but only for archival purposes only. No workloads on SAN. Period.
There are a lot of features within ReFS that need to reach in a lot deeper thus the Storage Spaces/Storage Spaces Direct requirement.
-
@scottalanmiller said in Safe to have a 48TB Windows volume?:
@Obsolesce said in Safe to have a 48TB Windows volume?:
@scottalanmiller said in Safe to have a 48TB Windows volume?:
@Obsolesce said in Safe to have a 48TB Windows volume?:
@jim9500 said in Safe to have a 48TB Windows volume?:
Have any of you used 48TB Windows volumes? Any resources on risk analysis vs ZFS?
I have two that are close to 60 TB. But they are REFS and hold a lot of large virtual disks.
REFS on 2019 is what I would wait for, for bare file storage.
Are you on 2019 now or looking to move off of a Windows file server?
ReFS has a bad track record. It's got a future, but has been pretty lacking and presents a bit of risk. Microsoft has had a disastrous track record with storage recently, even if ReFS is supposed to get brought to production levels with 2019, 2019 is questionably production ready. Remember... data loss is why it was pulled out of production in the first place.
It's been great in my experience. Though, I am using it in such a way the risk is worth the benefits... replication and backup repositories. It's been 100% solid. And like I said, it's all huge files stored on it, and probably not the use case that you seen results in data loss. I haven't seen that anywhere, so only taking your word for it unless you have links for me to do some reading. Not dumb stuff from Tom's or whatever, reputable scenarios in correct use cases.
The problem with storage is that we expect durability of something like seven nines as a "minimum" for being production ready. That means no matter how many people having "good experiences" with it, that tells us nothing. It's the people having issues with it that matter. And ReFS lacks the stability, safety, and recoverability necessary for it to be considered production ready to normal people as a baseline.
But even systems that lose data 90% of the time, work perfectly for 10% of people.
The problem I have with this perspective is that some of us have more direct contacts with folks that have had their SAN storage blow up on them but nothing gets seen in the public. One that does come to mind is the Australian Government's very public SAN blow-out a few years ago.
There is no solution out there that's perfect. None. Nadda. Zippo. Zilch.
All solutions blow up, have failures, lose data, and outright stop working.
Thus, in my mind citing up-time, reliability, or any other such statistic is a moot point. It's essentially useless.
The reality for me is, and maybe my perspective is coloured by the fact that I've been on so many calls over the years with the other end being at their wit's end with a solution that has blown-up on them, no end of marketing fluff promoting a product as being five nines or whatever has an ounce/milligram of credibility to stand on. None.
The only answer that has any value to me at this point is this: Are the backups taken test restored to bare-metal or bare-hypervisor? Has your hyper-scale whatever been tested to failover without data loss?
The answer to the first question is a percentage I'm interested in and could probably guess. We all know the answer to the second question as there have been many public cloud data loss situations over the years.
[/PONTIFICATION]
-
@PhlipElder said in Safe to have a 48TB Windows volume?:
@scottalanmiller said in Safe to have a 48TB Windows volume?:
@Obsolesce said in Safe to have a 48TB Windows volume?:
@scottalanmiller said in Safe to have a 48TB Windows volume?:
@Obsolesce said in Safe to have a 48TB Windows volume?:
@jim9500 said in Safe to have a 48TB Windows volume?:
Have any of you used 48TB Windows volumes? Any resources on risk analysis vs ZFS?
I have two that are close to 60 TB. But they are REFS and hold a lot of large virtual disks.
REFS on 2019 is what I would wait for, for bare file storage.
Are you on 2019 now or looking to move off of a Windows file server?
ReFS has a bad track record. It's got a future, but has been pretty lacking and presents a bit of risk. Microsoft has had a disastrous track record with storage recently, even if ReFS is supposed to get brought to production levels with 2019, 2019 is questionably production ready. Remember... data loss is why it was pulled out of production in the first place.
It's been great in my experience. Though, I am using it in such a way the risk is worth the benefits... replication and backup repositories. It's been 100% solid. And like I said, it's all huge files stored on it, and probably not the use case that you seen results in data loss. I haven't seen that anywhere, so only taking your word for it unless you have links for me to do some reading. Not dumb stuff from Tom's or whatever, reputable scenarios in correct use cases.
The problem with storage is that we expect durability of something like seven nines as a "minimum" for being production ready. That means no matter how many people having "good experiences" with it, that tells us nothing. It's the people having issues with it that matter. And ReFS lacks the stability, safety, and recoverability necessary for it to be considered production ready to normal people as a baseline.
But even systems that lose data 90% of the time, work perfectly for 10% of people.
The problem I have with this perspective is that some of us have more direct contacts with folks that have had their SAN storage blow up on them but nothing gets seen in the public. One that does come to mind is the Australian Government's very public SAN blow-out a few years ago.
There is no solution out there that's perfect. None. Nadda. Zippo. Zilch.
All solutions blow up, have failures, lose data, and outright stop working.
Thus, in my mind citing up-time, reliability, or any other such statistic is a moot point. It's essentially useless.
Not at all. Reliability stats are SUPER important. There's ton of value. When we are dealing with systems expecting durability like this, those stats tell us a wealth of information. You can't dismiss the only data we have on reliability. It's far from useless.
-
@PhlipElder said in Safe to have a 48TB Windows volume?:
The reality for me is, and maybe my perspective is coloured by the fact that I've been on so many calls over the years with the other end being at their wit's end with a solution that has blown-up on them, no end of marketing fluff promoting a product as being five nines or whatever has an ounce/milligram of credibility to stand on. None.
Agreed, but that's why knowing that stuff can't be five nines due to the stats we've collected, is so important.
-
@PhlipElder said in Safe to have a 48TB Windows volume?:
The only answer that has any value to me at this point is this: Are the backups taken test restored to bare-metal or bare-hypervisor? Has your hyper-scale whatever been tested to failover without data loss?
I think this is a terrible approach. This leads to creating systems that mathematically or statistically we'd expect to fail. If this was our true thought process, we'd skip tried and true systems like RAID, because we'd not trust them (even with studies that show how reliable that they are) because we are connecting them to some unethical SAN vendor who made false reliability stats and hides all failures from the public to trick us. We can't allow an emotional reaction to having sales people try to trick us with clearly false data lead us to do something dangerous.
There is a lot of real, non-vendor, information out there in the industry. And a lot of just common sense. And some real studies on reliability that are actually based on math. We don't have to be blind or emotional. With good math, observation, elimination of marketing information, logic, and common sense... we can have a really good starting point. Are we still partially blind? Of course. But can we start from an educated point with a low level of risk? Absolutely.
Basically, just because you can still have an accident doesn't mean that you shouldn't keep wearing your seatbelt and avoid hitting pot holes.
-
Some examples of things we have math to tell us are good or bad...
RAID 10 .... we've done massive empirical studies. We know that the RAID systems themselves are insanely reliable.
Cheap SAN like the P2000 .... we know that by collecting anecdotes, and knowing total sales figures, that the failure rates of those observed alone is too high for the entire existing set of products made, and we can safely assume that the number we have not observed is vastly higher. But observation alone tells us that the reliability is not high enough for any production use. -
@scottalanmiller said in Safe to have a 48TB Windows volume?:
@PhlipElder said in Safe to have a 48TB Windows volume?:
@scottalanmiller said in Safe to have a 48TB Windows volume?:
@Obsolesce said in Safe to have a 48TB Windows volume?:
@scottalanmiller said in Safe to have a 48TB Windows volume?:
@Obsolesce said in Safe to have a 48TB Windows volume?:
@jim9500 said in Safe to have a 48TB Windows volume?:
Have any of you used 48TB Windows volumes? Any resources on risk analysis vs ZFS?
I have two that are close to 60 TB. But they are REFS and hold a lot of large virtual disks.
REFS on 2019 is what I would wait for, for bare file storage.
Are you on 2019 now or looking to move off of a Windows file server?
ReFS has a bad track record. It's got a future, but has been pretty lacking and presents a bit of risk. Microsoft has had a disastrous track record with storage recently, even if ReFS is supposed to get brought to production levels with 2019, 2019 is questionably production ready. Remember... data loss is why it was pulled out of production in the first place.
It's been great in my experience. Though, I am using it in such a way the risk is worth the benefits... replication and backup repositories. It's been 100% solid. And like I said, it's all huge files stored on it, and probably not the use case that you seen results in data loss. I haven't seen that anywhere, so only taking your word for it unless you have links for me to do some reading. Not dumb stuff from Tom's or whatever, reputable scenarios in correct use cases.
The problem with storage is that we expect durability of something like seven nines as a "minimum" for being production ready. That means no matter how many people having "good experiences" with it, that tells us nothing. It's the people having issues with it that matter. And ReFS lacks the stability, safety, and recoverability necessary for it to be considered production ready to normal people as a baseline.
But even systems that lose data 90% of the time, work perfectly for 10% of people.
The problem I have with this perspective is that some of us have more direct contacts with folks that have had their SAN storage blow up on them but nothing gets seen in the public. One that does come to mind is the Australian Government's very public SAN blow-out a few years ago.
There is no solution out there that's perfect. None. Nadda. Zippo. Zilch.
All solutions blow up, have failures, lose data, and outright stop working.
Thus, in my mind citing up-time, reliability, or any other such statistic is a moot point. It's essentially useless.
Not at all. Reliability stats are SUPER important. There's ton of value. When we are dealing with systems expecting durability like this, those stats tell us a wealth of information. You can't dismiss the only data we have on reliability. It's far from useless.
BackBlaze is probably the only vendor I can think of that has told the drive vendors to take a flying leap and published what I consider to be real reliability statistics.
There are vendors, VMware for vSAN and Nutanix come to mind, that have specific NDAs in place that block any mention of their product's reliability and performance.
Drive vendors also have a similar clause but note BackBlaze.
Other than BackBlaze, the reliability statistics that I can find reliable are the ones that we have based on all of the solution sets we've built and deployed or worked with over the years. Those numbers tell a pretty good story. But, so too do the statistics that come about as result of the aforementioned panicked phone call.
Anything else in the public sphere has about the same weight as CRN, PCMag, ConsumerReports, or any other marketing fluff type.
-
@scottalanmiller said in Safe to have a 48TB Windows volume?:
Some examples of things we have math to tell us are good or bad...
RAID 10 .... we've done massive empirical studies. We know that the RAID systems themselves are insanely reliable.
Cheap SAN like the P2000 .... we know that by collecting anecdotes, and knowing total sales figures, that the failure rates of those observed alone is too high for the entire existing set of products made, and we can safely assume that the number we have not observed is vastly higher. But observation alone tells us that the reliability is not high enough for any production use.We lost an entire virtualization platform and had to recover from scratch because the second member of a RAID 10 pair failed after replacing the first and a rebuild initiating. We'll stick with RAID 6 thanks.
EDIT: The on-site IT and I were well into our coffee chat when the spontaneous beep/beep happened and we were both, WTF?
-
@PhlipElder said in Safe to have a 48TB Windows volume?:
@scottalanmiller said in Safe to have a 48TB Windows volume?:
@PhlipElder said in Safe to have a 48TB Windows volume?:
@scottalanmiller said in Safe to have a 48TB Windows volume?:
@Obsolesce said in Safe to have a 48TB Windows volume?:
@scottalanmiller said in Safe to have a 48TB Windows volume?:
@Obsolesce said in Safe to have a 48TB Windows volume?:
@jim9500 said in Safe to have a 48TB Windows volume?:
Have any of you used 48TB Windows volumes? Any resources on risk analysis vs ZFS?
I have two that are close to 60 TB. But they are REFS and hold a lot of large virtual disks.
REFS on 2019 is what I would wait for, for bare file storage.
Are you on 2019 now or looking to move off of a Windows file server?
ReFS has a bad track record. It's got a future, but has been pretty lacking and presents a bit of risk. Microsoft has had a disastrous track record with storage recently, even if ReFS is supposed to get brought to production levels with 2019, 2019 is questionably production ready. Remember... data loss is why it was pulled out of production in the first place.
It's been great in my experience. Though, I am using it in such a way the risk is worth the benefits... replication and backup repositories. It's been 100% solid. And like I said, it's all huge files stored on it, and probably not the use case that you seen results in data loss. I haven't seen that anywhere, so only taking your word for it unless you have links for me to do some reading. Not dumb stuff from Tom's or whatever, reputable scenarios in correct use cases.
The problem with storage is that we expect durability of something like seven nines as a "minimum" for being production ready. That means no matter how many people having "good experiences" with it, that tells us nothing. It's the people having issues with it that matter. And ReFS lacks the stability, safety, and recoverability necessary for it to be considered production ready to normal people as a baseline.
But even systems that lose data 90% of the time, work perfectly for 10% of people.
The problem I have with this perspective is that some of us have more direct contacts with folks that have had their SAN storage blow up on them but nothing gets seen in the public. One that does come to mind is the Australian Government's very public SAN blow-out a few years ago.
There is no solution out there that's perfect. None. Nadda. Zippo. Zilch.
All solutions blow up, have failures, lose data, and outright stop working.
Thus, in my mind citing up-time, reliability, or any other such statistic is a moot point. It's essentially useless.
Not at all. Reliability stats are SUPER important. There's ton of value. When we are dealing with systems expecting durability like this, those stats tell us a wealth of information. You can't dismiss the only data we have on reliability. It's far from useless.
BackBlaze is probably the only vendor I can think of that has told the drive vendors to take a flying leap and published what I consider to be real reliability statistics.
There are vendors, VMware for vSAN and Nutanix come to mind, that have specific NDAs in place that block any mention of their product's reliability and performance.
Drive vendors also have a similar clause but note BackBlaze.
Other than BackBlaze, the reliability statistics that I can find reliable are the ones that we have based on all of the solution sets we've built and deployed or worked with over the years. Those numbers tell a pretty good story. But, so too do the statistics that come about as result of the aforementioned panicked phone call.
Anything else in the public sphere has about the same weight as CRN, PCMag, ConsumerReports, or any other marketing fluff type.
Needing someone else to do a study for you is part of the issue. I myself have done the largest RAID study I've ever heard of (over 80,000 array years.) And we don't need a third party to do studies of some SAN systems, for example.
Sure, there are loads of things we have to be blind to. But there is a ton that we know, and a ton that we can reasonable extrapolate.
We have a lot more information than people give us credit for. But people tend to focus on the lack of big vendors doing big studies, which sadly are just impossible to have happen. We expect reliability rates so high that often you can't study them on products, ever. We simply don't make and run products long enough for even the vendors to know.
-
@PhlipElder said in Safe to have a 48TB Windows volume?:
@scottalanmiller said in Safe to have a 48TB Windows volume?:
Some examples of things we have math to tell us are good or bad...
RAID 10 .... we've done massive empirical studies. We know that the RAID systems themselves are insanely reliable.
Cheap SAN like the P2000 .... we know that by collecting anecdotes, and knowing total sales figures, that the failure rates of those observed alone is too high for the entire existing set of products made, and we can safely assume that the number we have not observed is vastly higher. But observation alone tells us that the reliability is not high enough for any production use.We lost an entire virtualization platform and had to recover from scratch because the second member of a RAID 10 pair failed after replacing the first and a rebuild initiating. We'll stick with RAID 6 thanks.
EDIT: The on-site IT and I were well into our coffee chat when the spontaneous beep/beep happened and we were both, WTF?
See, that's an irrational, emotional reaction that we are trying to avoid. You have one anecdote that tells you nothing, but you make a decision based on it that goes against math and empirical studies. Why?
And even the anecdote doesn't tell you that RAID 6 would have protected you. Only that RAID 10 wasn't able to.
Had you used RAID 6, it might have failed too, possibly worse, and we'd be having the opposite conversation about how you can never trust RAID 6.
Bottom line, using individual anecdotes for answers is the one thing we know is bad to do.
-
@scottalanmiller said in Safe to have a 48TB Windows volume?:
@PhlipElder said in Safe to have a 48TB Windows volume?:
@scottalanmiller said in Safe to have a 48TB Windows volume?:
Some examples of things we have math to tell us are good or bad...
RAID 10 .... we've done massive empirical studies. We know that the RAID systems themselves are insanely reliable.
Cheap SAN like the P2000 .... we know that by collecting anecdotes, and knowing total sales figures, that the failure rates of those observed alone is too high for the entire existing set of products made, and we can safely assume that the number we have not observed is vastly higher. But observation alone tells us that the reliability is not high enough for any production use.We lost an entire virtualization platform and had to recover from scratch because the second member of a RAID 10 pair failed after replacing the first and a rebuild initiating. We'll stick with RAID 6 thanks.
EDIT: The on-site IT and I were well into our coffee chat when the spontaneous beep/beep happened and we were both, WTF?
See, that's an irrational, emotional reaction that we are trying to avoid. You have one anecdote that tells you nothing, but you make a decision based on it that goes against math and empirical studies. Why?
And even the anecdote doesn't tell you that RAID 6 would have protected you. Only that RAID 10 wasn't able to.
apparently he needed the 3 drive RAID 10 pairs that other guy was running.
-
@scottalanmiller said in Safe to have a 48TB Windows volume?:
@PhlipElder said in Safe to have a 48TB Windows volume?:
@scottalanmiller said in Safe to have a 48TB Windows volume?:
Some examples of things we have math to tell us are good or bad...
RAID 10 .... we've done massive empirical studies. We know that the RAID systems themselves are insanely reliable.
Cheap SAN like the P2000 .... we know that by collecting anecdotes, and knowing total sales figures, that the failure rates of those observed alone is too high for the entire existing set of products made, and we can safely assume that the number we have not observed is vastly higher. But observation alone tells us that the reliability is not high enough for any production use.We lost an entire virtualization platform and had to recover from scratch because the second member of a RAID 10 pair failed after replacing the first and a rebuild initiating. We'll stick with RAID 6 thanks.
EDIT: The on-site IT and I were well into our coffee chat when the spontaneous beep/beep happened and we were both, WTF?
See, that's an irrational, emotional reaction that we are trying to avoid. You have one anecdote that tells you nothing, but you make a decision based on it that goes against math and empirical studies. Why?
The fact that you and possibly your org has actually studied things is important to the discussion.
We've had enough double disk failures over time to have influenced the decision to drop RAID 5. The RAID 10 failure was icing on the cake. Not an emotional reaction, just one that falls into what we've experienced failure wise across the board.