Safe to have a 48TB Windows volume?



  • @jim9500 said in Safe to have a 48TB Windows volume?:

    @PhlipElder said in Safe to have a 48TB Windows volume?:

    What's the air-gap to protect against an encryption event if any?

    What's the air-gap to protect against an encryption event if any?

    My backup server has access to the rest of the network - but it pulls the backups to itself vs backups being pushed. The rest of the network can't directly write to it. My backups happen weekly - so my (hope) is that I would recognize what was happening to my live network before it was backed up.

    I have been contemplating doubling my backup storage space to make sure I have enough space to store older file revisions in a ransomware situation.

    Might be a good idea. Although at that size, encryption would take a very long time.



  • @scottalanmiller said in Safe to have a 48TB Windows volume?:

    @PhlipElder said in Safe to have a 48TB Windows volume?:

    @jim9500 said in Safe to have a 48TB Windows volume?:

    It seems like I remember Scott Miller talking about combining enterprise hardware + SAS/SATA Controller + Linux for storage requirements vs proprietary hardware raid controller.

    @Donahue - Yes. I have a similar setup offsite backup several miles away for disaster recovery / hardware failure etc. I know raid != backups.

    What's the air-gap to protect against an encryption event if any?

    LOL. I like that term. "Encryption Event"

    It implies, quite correctly, that many of those problems are not exactly malware. Many are just bad system design.

    Indeed. We've "heard" of cloud vendors that have lost both their own and their tenant's environments due to an encryption event which implies improper setup and procedures.

    As far as the backup server pulling the data on to itself one needs to make sure no credentials are saved anywhere. All it takes is one lazy tech doing so and the baddies are in. Rotating that password regularly would help to stem that.

    Gostev (Veeam) has a regular newsletter and mentioned that offlining the backup server with it firing up to do its pulls then shutting itself back down again once done would be one way of dealing with having an air-gap.

    EDIT: Setting that "Cannot Save Credentials" setting for RDS in Local GPMC would work too.



  • @jim9500 said in Safe to have a 48TB Windows volume?:

    @PhlipElder said in Safe to have a 48TB Windows volume?:

    What's the air-gap to protect against an encryption event if any?

    What's the air-gap to protect against an encryption event if any?

    My backup server has access to the rest of the network - but it pulls the backups to itself vs backups being pushed. The rest of the network can't directly write to it. My backups happen weekly - so my (hope) is that I would recognize what was happening to my live network before it was backed up.

    I have been contemplating doubling my backup storage space to make sure I have enough space to store older file revisions in a ransomware situation.

    Is it a backup or just a copy? If it's a backup, thinking something like Veeam here, then having multiple backup copies on the backup server won't need say - double the space to have two full copies, it will need the amount of typical changes between backups, though I'd go for twice that difference so you can take a backup, then add the second backup, then add a third backup, then delete the second backup, etc. So you'll end up with two 'copies' on the backup at all time.



  • @scottalanmiller said in Safe to have a 48TB Windows volume?:

    NTFS has improved a lot over the years. This is definitely a big volume for NTFS to handle. ZFS is better designed for volumes of this size.

    You are correct, with your triple mirrored (and hot spare!) setup, it's your filesystem, not your array, that you have to worry about. You have definitely managed to shift the risk from the RAID to the FS.

    This isn't insanely big, but certainly having Windows managing storage always gives me a little moment of pause. Storage is not their strong suit and has weakened, rather than improving, in recent years. ReFS has had issues, the recent releases have had their own issues even with NTFS, and their software RAID has had big time issues (you aren't using that here, so not applicable either.) But this is just generally an area that Microsoft struggles with and doesn't tend to see as critical so seems to mostly poo-poo reliability concerns to focus on other areas.

    If I was doing storage this large, I would almost certainly be using XFS on hardware RAID based on your setup. XFS is faster than NTFS, and pretty much bullet proof.

    I agree. Last place I worked we did 96TB arrays on RAID 10 with XFS.



  • @scottalanmiller I somehow missed this reply. This is the answer I was looking for. The great news is that my hardware will likely stay (almost) the same when I need to upgrade.



  • @Dashrender said in Safe to have a 48TB Windows volume?:

    @jim9500 said in Safe to have a 48TB Windows volume?:

    @PhlipElder said in Safe to have a 48TB Windows volume?:

    What's the air-gap to protect against an encryption event if any?

    What's the air-gap to protect against an encryption event if any?

    My backup server has access to the rest of the network - but it pulls the backups to itself vs backups being pushed. The rest of the network can't directly write to it. My backups happen weekly - so my (hope) is that I would recognize what was happening to my live network before it was backed up.

    I have been contemplating doubling my backup storage space to make sure I have enough space to store older file revisions in a ransomware situation.

    Is it a backup or just a copy?

    There isn't a difference. Backups are just decoupled copies.



  • @scottalanmiller said in Safe to have a 48TB Windows volume?:

    @DustinB3403 said in Safe to have a 48TB Windows volume?:

    Doesn't ntfs have a limit of 16TB per volume?

    NTFS volume limit is 256TB in older systems.

    NTFS has an 8PB volume limit in modern ones.

    The one caveat to NTFS Volumes as far as size goes is the 64TB limit for Volume Shadow Copy snapshots. A lot of products use VSS for their purposes.



  • @PhlipElder said in Safe to have a 48TB Windows volume?:

    @scottalanmiller said in Safe to have a 48TB Windows volume?:

    @DustinB3403 said in Safe to have a 48TB Windows volume?:

    Doesn't ntfs have a limit of 16TB per volume?

    NTFS volume limit is 256TB in older systems.

    NTFS has an 8PB volume limit in modern ones.

    The one caveat to NTFS Volumes as far as size goes is the 64TB limit for Volume Shadow Copy snapshots. A lot of products use VSS for their purposes.

    Major caveat there!



  • @jim9500 said in Safe to have a 48TB Windows volume?:

    Have any of you used 48TB Windows volumes? Any resources on risk analysis vs ZFS?

    I have two that are close to 60 TB. But they are REFS and hold a lot of large virtual disks.

    REFS on 2019 is what I would wait for, for bare file storage.

    Are you on 2019 now or looking to move off of a Windows file server?



  • @Obsolesce said in Safe to have a 48TB Windows volume?:

    @jim9500 said in Safe to have a 48TB Windows volume?:

    Have any of you used 48TB Windows volumes? Any resources on risk analysis vs ZFS?

    I have two that are close to 60 TB. But they are REFS and hold a lot of large virtual disks.

    REFS on 2019 is what I would wait for, for bare file storage.

    Are you on 2019 now or looking to move off of a Windows file server?

    ReFS has a bad track record. It's got a future, but has been pretty lacking and presents a bit of risk. Microsoft has had a disastrous track record with storage recently, even if ReFS is supposed to get brought to production levels with 2019, 2019 is questionably production ready. Remember... data loss is why it was pulled out of production in the first place.



  • @scottalanmiller said in Safe to have a 48TB Windows volume?:

    @Obsolesce said in Safe to have a 48TB Windows volume?:

    @jim9500 said in Safe to have a 48TB Windows volume?:

    Have any of you used 48TB Windows volumes? Any resources on risk analysis vs ZFS?

    I have two that are close to 60 TB. But they are REFS and hold a lot of large virtual disks.

    REFS on 2019 is what I would wait for, for bare file storage.

    Are you on 2019 now or looking to move off of a Windows file server?

    ReFS has a bad track record. It's got a future, but has been pretty lacking and presents a bit of risk. Microsoft has had a disastrous track record with storage recently, even if ReFS is supposed to get brought to production levels with 2019, 2019 is questionably production ready. Remember... data loss is why it was pulled out of production in the first place.

    It's been great in my experience. Though, I am using it in such a way the risk is worth the benefits... replication and backup repositories. It's been 100% solid. And like I said, it's all huge files stored on it, and probably not the use case that you seen results in data loss. I haven't seen that anywhere, so only taking your word for it unless you have links for me to do some reading. Not dumb stuff from Tom's or whatever, reputable scenarios in correct use cases.





  • Run a Chkdsk in that volume can take days..



  • @scottalanmiller said in Safe to have a 48TB Windows volume?:

    @Obsolesce said in Safe to have a 48TB Windows volume?:

    @jim9500 said in Safe to have a 48TB Windows volume?:

    Have any of you used 48TB Windows volumes? Any resources on risk analysis vs ZFS?

    I have two that are close to 60 TB. But they are REFS and hold a lot of large virtual disks.

    REFS on 2019 is what I would wait for, for bare file storage.

    Are you on 2019 now or looking to move off of a Windows file server?

    ReFS has a bad track record. It's got a future, but has been pretty lacking and presents a bit of risk. Microsoft has had a disastrous track record with storage recently, even if ReFS is supposed to get brought to production levels with 2019, 2019 is questionably production ready. Remember... data loss is why it was pulled out of production in the first place.

    If you're talking about why 2019 (and Windows 10 1809) were pulled, that data loss has nothing to do with REFS. Additionally, REFS was removed from Windows 10 for all versions exception workstation.



  • @Dashrender said in Safe to have a 48TB Windows volume?:

    @scottalanmiller said in Safe to have a 48TB Windows volume?:

    @Obsolesce said in Safe to have a 48TB Windows volume?:

    @jim9500 said in Safe to have a 48TB Windows volume?:

    Have any of you used 48TB Windows volumes? Any resources on risk analysis vs ZFS?

    I have two that are close to 60 TB. But they are REFS and hold a lot of large virtual disks.

    REFS on 2019 is what I would wait for, for bare file storage.

    Are you on 2019 now or looking to move off of a Windows file server?

    ReFS has a bad track record. It's got a future, but has been pretty lacking and presents a bit of risk. Microsoft has had a disastrous track record with storage recently, even if ReFS is supposed to get brought to production levels with 2019, 2019 is questionably production ready. Remember... data loss is why it was pulled out of production in the first place.

    If you're talking about why 2019 (and Windows 10 1809) were pulled, that data loss has nothing to do with REFS. Additionally, REFS was removed from Windows 10 for all versions exception workstation.

    I never said it did. Why would it need to be? There are issues with Microsoft and storage in general, problems with ReFS in general, and problems with 2019 in regards to storage. What more do you need to be wary?



  • @Obsolesce said in Safe to have a 48TB Windows volume?:

    @scottalanmiller said in Safe to have a 48TB Windows volume?:

    @Obsolesce said in Safe to have a 48TB Windows volume?:

    @jim9500 said in Safe to have a 48TB Windows volume?:

    Have any of you used 48TB Windows volumes? Any resources on risk analysis vs ZFS?

    I have two that are close to 60 TB. But they are REFS and hold a lot of large virtual disks.

    REFS on 2019 is what I would wait for, for bare file storage.

    Are you on 2019 now or looking to move off of a Windows file server?

    ReFS has a bad track record. It's got a future, but has been pretty lacking and presents a bit of risk. Microsoft has had a disastrous track record with storage recently, even if ReFS is supposed to get brought to production levels with 2019, 2019 is questionably production ready. Remember... data loss is why it was pulled out of production in the first place.

    It's been great in my experience. Though, I am using it in such a way the risk is worth the benefits... replication and backup repositories. It's been 100% solid. And like I said, it's all huge files stored on it, and probably not the use case that you seen results in data loss. I haven't seen that anywhere, so only taking your word for it unless you have links for me to do some reading. Not dumb stuff from Tom's or whatever, reputable scenarios in correct use cases.

    The problem with storage is that we expect durability of something like seven nines as a "minimum" for being production ready. That means no matter how many people having "good experiences" with it, that tells us nothing. It's the people having issues with it that matter. And ReFS lacks the stability, safety, and recoverability necessary for it to be considered production ready to normal people as a baseline.

    But even systems that lose data 90% of the time, work perfectly for 10% of people.



  • @Obsolesce said in Safe to have a 48TB Windows volume?:

    @jim9500 said in Safe to have a 48TB Windows volume?:

    Have any of you used 48TB Windows volumes? Any resources on risk analysis vs ZFS?

    I have two that are close to 60 TB. But they are REFS and hold a lot of large virtual disks.

    REFS on 2019 is what I would wait for, for bare file storage.

    Are you on 2019 now or looking to move off of a Windows file server?

    ReFS is supported for production workloads on Storage Spaces Direct and Storage Spaces. With the Server 2019 ReFS generation Microsoft has relented to some degree and stated that ReFS can be done on SAN but only for archival purposes only. No workloads on SAN. Period.

    There are a lot of features within ReFS that need to reach in a lot deeper thus the Storage Spaces/Storage Spaces Direct requirement.



  • @scottalanmiller said in Safe to have a 48TB Windows volume?:

    @Obsolesce said in Safe to have a 48TB Windows volume?:

    @scottalanmiller said in Safe to have a 48TB Windows volume?:

    @Obsolesce said in Safe to have a 48TB Windows volume?:

    @jim9500 said in Safe to have a 48TB Windows volume?:

    Have any of you used 48TB Windows volumes? Any resources on risk analysis vs ZFS?

    I have two that are close to 60 TB. But they are REFS and hold a lot of large virtual disks.

    REFS on 2019 is what I would wait for, for bare file storage.

    Are you on 2019 now or looking to move off of a Windows file server?

    ReFS has a bad track record. It's got a future, but has been pretty lacking and presents a bit of risk. Microsoft has had a disastrous track record with storage recently, even if ReFS is supposed to get brought to production levels with 2019, 2019 is questionably production ready. Remember... data loss is why it was pulled out of production in the first place.

    It's been great in my experience. Though, I am using it in such a way the risk is worth the benefits... replication and backup repositories. It's been 100% solid. And like I said, it's all huge files stored on it, and probably not the use case that you seen results in data loss. I haven't seen that anywhere, so only taking your word for it unless you have links for me to do some reading. Not dumb stuff from Tom's or whatever, reputable scenarios in correct use cases.

    The problem with storage is that we expect durability of something like seven nines as a "minimum" for being production ready. That means no matter how many people having "good experiences" with it, that tells us nothing. It's the people having issues with it that matter. And ReFS lacks the stability, safety, and recoverability necessary for it to be considered production ready to normal people as a baseline.

    But even systems that lose data 90% of the time, work perfectly for 10% of people.

    The problem I have with this perspective is that some of us have more direct contacts with folks that have had their SAN storage blow up on them but nothing gets seen in the public. One that does come to mind is the Australian Government's very public SAN blow-out a few years ago.

    There is no solution out there that's perfect. None. Nadda. Zippo. Zilch.

    All solutions blow up, have failures, lose data, and outright stop working.

    Thus, in my mind citing up-time, reliability, or any other such statistic is a moot point. It's essentially useless.

    The reality for me is, and maybe my perspective is coloured by the fact that I've been on so many calls over the years with the other end being at their wit's end with a solution that has blown-up on them, no end of marketing fluff promoting a product as being five nines or whatever has an ounce/milligram of credibility to stand on. None.

    The only answer that has any value to me at this point is this: Are the backups taken test restored to bare-metal or bare-hypervisor? Has your hyper-scale whatever been tested to failover without data loss?

    The answer to the first question is a percentage I'm interested in and could probably guess. We all know the answer to the second question as there have been many public cloud data loss situations over the years.

    [/PONTIFICATION]



  • @PhlipElder said in Safe to have a 48TB Windows volume?:

    @scottalanmiller said in Safe to have a 48TB Windows volume?:

    @Obsolesce said in Safe to have a 48TB Windows volume?:

    @scottalanmiller said in Safe to have a 48TB Windows volume?:

    @Obsolesce said in Safe to have a 48TB Windows volume?:

    @jim9500 said in Safe to have a 48TB Windows volume?:

    Have any of you used 48TB Windows volumes? Any resources on risk analysis vs ZFS?

    I have two that are close to 60 TB. But they are REFS and hold a lot of large virtual disks.

    REFS on 2019 is what I would wait for, for bare file storage.

    Are you on 2019 now or looking to move off of a Windows file server?

    ReFS has a bad track record. It's got a future, but has been pretty lacking and presents a bit of risk. Microsoft has had a disastrous track record with storage recently, even if ReFS is supposed to get brought to production levels with 2019, 2019 is questionably production ready. Remember... data loss is why it was pulled out of production in the first place.

    It's been great in my experience. Though, I am using it in such a way the risk is worth the benefits... replication and backup repositories. It's been 100% solid. And like I said, it's all huge files stored on it, and probably not the use case that you seen results in data loss. I haven't seen that anywhere, so only taking your word for it unless you have links for me to do some reading. Not dumb stuff from Tom's or whatever, reputable scenarios in correct use cases.

    The problem with storage is that we expect durability of something like seven nines as a "minimum" for being production ready. That means no matter how many people having "good experiences" with it, that tells us nothing. It's the people having issues with it that matter. And ReFS lacks the stability, safety, and recoverability necessary for it to be considered production ready to normal people as a baseline.

    But even systems that lose data 90% of the time, work perfectly for 10% of people.

    The problem I have with this perspective is that some of us have more direct contacts with folks that have had their SAN storage blow up on them but nothing gets seen in the public. One that does come to mind is the Australian Government's very public SAN blow-out a few years ago.

    There is no solution out there that's perfect. None. Nadda. Zippo. Zilch.

    All solutions blow up, have failures, lose data, and outright stop working.

    Thus, in my mind citing up-time, reliability, or any other such statistic is a moot point. It's essentially useless.

    Not at all. Reliability stats are SUPER important. There's ton of value. When we are dealing with systems expecting durability like this, those stats tell us a wealth of information. You can't dismiss the only data we have on reliability. It's far from useless.



  • @PhlipElder said in Safe to have a 48TB Windows volume?:

    The reality for me is, and maybe my perspective is coloured by the fact that I've been on so many calls over the years with the other end being at their wit's end with a solution that has blown-up on them, no end of marketing fluff promoting a product as being five nines or whatever has an ounce/milligram of credibility to stand on. None.

    Agreed, but that's why knowing that stuff can't be five nines due to the stats we've collected, is so important.



  • @PhlipElder said in Safe to have a 48TB Windows volume?:

    The only answer that has any value to me at this point is this: Are the backups taken test restored to bare-metal or bare-hypervisor? Has your hyper-scale whatever been tested to failover without data loss?

    I think this is a terrible approach. This leads to creating systems that mathematically or statistically we'd expect to fail. If this was our true thought process, we'd skip tried and true systems like RAID, because we'd not trust them (even with studies that show how reliable that they are) because we are connecting them to some unethical SAN vendor who made false reliability stats and hides all failures from the public to trick us. We can't allow an emotional reaction to having sales people try to trick us with clearly false data lead us to do something dangerous.

    There is a lot of real, non-vendor, information out there in the industry. And a lot of just common sense. And some real studies on reliability that are actually based on math. We don't have to be blind or emotional. With good math, observation, elimination of marketing information, logic, and common sense... we can have a really good starting point. Are we still partially blind? Of course. But can we start from an educated point with a low level of risk? Absolutely.

    Basically, just because you can still have an accident doesn't mean that you shouldn't keep wearing your seatbelt and avoid hitting pot holes.



  • Some examples of things we have math to tell us are good or bad...

    RAID 10 .... we've done massive empirical studies. We know that the RAID systems themselves are insanely reliable.
    Cheap SAN like the P2000 .... we know that by collecting anecdotes, and knowing total sales figures, that the failure rates of those observed alone is too high for the entire existing set of products made, and we can safely assume that the number we have not observed is vastly higher. But observation alone tells us that the reliability is not high enough for any production use.



  • @scottalanmiller said in Safe to have a 48TB Windows volume?:

    @PhlipElder said in Safe to have a 48TB Windows volume?:

    @scottalanmiller said in Safe to have a 48TB Windows volume?:

    @Obsolesce said in Safe to have a 48TB Windows volume?:

    @scottalanmiller said in Safe to have a 48TB Windows volume?:

    @Obsolesce said in Safe to have a 48TB Windows volume?:

    @jim9500 said in Safe to have a 48TB Windows volume?:

    Have any of you used 48TB Windows volumes? Any resources on risk analysis vs ZFS?

    I have two that are close to 60 TB. But they are REFS and hold a lot of large virtual disks.

    REFS on 2019 is what I would wait for, for bare file storage.

    Are you on 2019 now or looking to move off of a Windows file server?

    ReFS has a bad track record. It's got a future, but has been pretty lacking and presents a bit of risk. Microsoft has had a disastrous track record with storage recently, even if ReFS is supposed to get brought to production levels with 2019, 2019 is questionably production ready. Remember... data loss is why it was pulled out of production in the first place.

    It's been great in my experience. Though, I am using it in such a way the risk is worth the benefits... replication and backup repositories. It's been 100% solid. And like I said, it's all huge files stored on it, and probably not the use case that you seen results in data loss. I haven't seen that anywhere, so only taking your word for it unless you have links for me to do some reading. Not dumb stuff from Tom's or whatever, reputable scenarios in correct use cases.

    The problem with storage is that we expect durability of something like seven nines as a "minimum" for being production ready. That means no matter how many people having "good experiences" with it, that tells us nothing. It's the people having issues with it that matter. And ReFS lacks the stability, safety, and recoverability necessary for it to be considered production ready to normal people as a baseline.

    But even systems that lose data 90% of the time, work perfectly for 10% of people.

    The problem I have with this perspective is that some of us have more direct contacts with folks that have had their SAN storage blow up on them but nothing gets seen in the public. One that does come to mind is the Australian Government's very public SAN blow-out a few years ago.

    There is no solution out there that's perfect. None. Nadda. Zippo. Zilch.

    All solutions blow up, have failures, lose data, and outright stop working.

    Thus, in my mind citing up-time, reliability, or any other such statistic is a moot point. It's essentially useless.

    Not at all. Reliability stats are SUPER important. There's ton of value. When we are dealing with systems expecting durability like this, those stats tell us a wealth of information. You can't dismiss the only data we have on reliability. It's far from useless.

    BackBlaze is probably the only vendor I can think of that has told the drive vendors to take a flying leap and published what I consider to be real reliability statistics.

    There are vendors, VMware for vSAN and Nutanix come to mind, that have specific NDAs in place that block any mention of their product's reliability and performance.

    Drive vendors also have a similar clause but note BackBlaze.

    Other than BackBlaze, the reliability statistics that I can find reliable are the ones that we have based on all of the solution sets we've built and deployed or worked with over the years. Those numbers tell a pretty good story. But, so too do the statistics that come about as result of the aforementioned panicked phone call.

    Anything else in the public sphere has about the same weight as CRN, PCMag, ConsumerReports, or any other marketing fluff type.



  • @scottalanmiller said in Safe to have a 48TB Windows volume?:

    Some examples of things we have math to tell us are good or bad...

    RAID 10 .... we've done massive empirical studies. We know that the RAID systems themselves are insanely reliable.
    Cheap SAN like the P2000 .... we know that by collecting anecdotes, and knowing total sales figures, that the failure rates of those observed alone is too high for the entire existing set of products made, and we can safely assume that the number we have not observed is vastly higher. But observation alone tells us that the reliability is not high enough for any production use.

    We lost an entire virtualization platform and had to recover from scratch because the second member of a RAID 10 pair failed after replacing the first and a rebuild initiating. We'll stick with RAID 6 thanks.

    EDIT: The on-site IT and I were well into our coffee chat when the spontaneous beep/beep happened and we were both, WTF?



  • @PhlipElder said in Safe to have a 48TB Windows volume?:

    @scottalanmiller said in Safe to have a 48TB Windows volume?:

    @PhlipElder said in Safe to have a 48TB Windows volume?:

    @scottalanmiller said in Safe to have a 48TB Windows volume?:

    @Obsolesce said in Safe to have a 48TB Windows volume?:

    @scottalanmiller said in Safe to have a 48TB Windows volume?:

    @Obsolesce said in Safe to have a 48TB Windows volume?:

    @jim9500 said in Safe to have a 48TB Windows volume?:

    Have any of you used 48TB Windows volumes? Any resources on risk analysis vs ZFS?

    I have two that are close to 60 TB. But they are REFS and hold a lot of large virtual disks.

    REFS on 2019 is what I would wait for, for bare file storage.

    Are you on 2019 now or looking to move off of a Windows file server?

    ReFS has a bad track record. It's got a future, but has been pretty lacking and presents a bit of risk. Microsoft has had a disastrous track record with storage recently, even if ReFS is supposed to get brought to production levels with 2019, 2019 is questionably production ready. Remember... data loss is why it was pulled out of production in the first place.

    It's been great in my experience. Though, I am using it in such a way the risk is worth the benefits... replication and backup repositories. It's been 100% solid. And like I said, it's all huge files stored on it, and probably not the use case that you seen results in data loss. I haven't seen that anywhere, so only taking your word for it unless you have links for me to do some reading. Not dumb stuff from Tom's or whatever, reputable scenarios in correct use cases.

    The problem with storage is that we expect durability of something like seven nines as a "minimum" for being production ready. That means no matter how many people having "good experiences" with it, that tells us nothing. It's the people having issues with it that matter. And ReFS lacks the stability, safety, and recoverability necessary for it to be considered production ready to normal people as a baseline.

    But even systems that lose data 90% of the time, work perfectly for 10% of people.

    The problem I have with this perspective is that some of us have more direct contacts with folks that have had their SAN storage blow up on them but nothing gets seen in the public. One that does come to mind is the Australian Government's very public SAN blow-out a few years ago.

    There is no solution out there that's perfect. None. Nadda. Zippo. Zilch.

    All solutions blow up, have failures, lose data, and outright stop working.

    Thus, in my mind citing up-time, reliability, or any other such statistic is a moot point. It's essentially useless.

    Not at all. Reliability stats are SUPER important. There's ton of value. When we are dealing with systems expecting durability like this, those stats tell us a wealth of information. You can't dismiss the only data we have on reliability. It's far from useless.

    BackBlaze is probably the only vendor I can think of that has told the drive vendors to take a flying leap and published what I consider to be real reliability statistics.

    There are vendors, VMware for vSAN and Nutanix come to mind, that have specific NDAs in place that block any mention of their product's reliability and performance.

    Drive vendors also have a similar clause but note BackBlaze.

    Other than BackBlaze, the reliability statistics that I can find reliable are the ones that we have based on all of the solution sets we've built and deployed or worked with over the years. Those numbers tell a pretty good story. But, so too do the statistics that come about as result of the aforementioned panicked phone call.

    Anything else in the public sphere has about the same weight as CRN, PCMag, ConsumerReports, or any other marketing fluff type.

    Needing someone else to do a study for you is part of the issue. I myself have done the largest RAID study I've ever heard of (over 80,000 array years.) And we don't need a third party to do studies of some SAN systems, for example.

    Sure, there are loads of things we have to be blind to. But there is a ton that we know, and a ton that we can reasonable extrapolate.

    We have a lot more information than people give us credit for. But people tend to focus on the lack of big vendors doing big studies, which sadly are just impossible to have happen. We expect reliability rates so high that often you can't study them on products, ever. We simply don't make and run products long enough for even the vendors to know.



  • @PhlipElder said in Safe to have a 48TB Windows volume?:

    @scottalanmiller said in Safe to have a 48TB Windows volume?:

    Some examples of things we have math to tell us are good or bad...

    RAID 10 .... we've done massive empirical studies. We know that the RAID systems themselves are insanely reliable.
    Cheap SAN like the P2000 .... we know that by collecting anecdotes, and knowing total sales figures, that the failure rates of those observed alone is too high for the entire existing set of products made, and we can safely assume that the number we have not observed is vastly higher. But observation alone tells us that the reliability is not high enough for any production use.

    We lost an entire virtualization platform and had to recover from scratch because the second member of a RAID 10 pair failed after replacing the first and a rebuild initiating. We'll stick with RAID 6 thanks.

    EDIT: The on-site IT and I were well into our coffee chat when the spontaneous beep/beep happened and we were both, WTF?

    See, that's an irrational, emotional reaction that we are trying to avoid. You have one anecdote that tells you nothing, but you make a decision based on it that goes against math and empirical studies. Why?

    And even the anecdote doesn't tell you that RAID 6 would have protected you. Only that RAID 10 wasn't able to.

    Had you used RAID 6, it might have failed too, possibly worse, and we'd be having the opposite conversation about how you can never trust RAID 6.

    Bottom line, using individual anecdotes for answers is the one thing we know is bad to do.



  • @scottalanmiller said in Safe to have a 48TB Windows volume?:

    @PhlipElder said in Safe to have a 48TB Windows volume?:

    @scottalanmiller said in Safe to have a 48TB Windows volume?:

    Some examples of things we have math to tell us are good or bad...

    RAID 10 .... we've done massive empirical studies. We know that the RAID systems themselves are insanely reliable.
    Cheap SAN like the P2000 .... we know that by collecting anecdotes, and knowing total sales figures, that the failure rates of those observed alone is too high for the entire existing set of products made, and we can safely assume that the number we have not observed is vastly higher. But observation alone tells us that the reliability is not high enough for any production use.

    We lost an entire virtualization platform and had to recover from scratch because the second member of a RAID 10 pair failed after replacing the first and a rebuild initiating. We'll stick with RAID 6 thanks.

    EDIT: The on-site IT and I were well into our coffee chat when the spontaneous beep/beep happened and we were both, WTF?

    See, that's an irrational, emotional reaction that we are trying to avoid. You have one anecdote that tells you nothing, but you make a decision based on it that goes against math and empirical studies. Why?

    And even the anecdote doesn't tell you that RAID 6 would have protected you. Only that RAID 10 wasn't able to.

    apparently he needed the 3 drive RAID 10 pairs that other guy was running. 😉



  • @scottalanmiller said in Safe to have a 48TB Windows volume?:

    @PhlipElder said in Safe to have a 48TB Windows volume?:

    @scottalanmiller said in Safe to have a 48TB Windows volume?:

    Some examples of things we have math to tell us are good or bad...

    RAID 10 .... we've done massive empirical studies. We know that the RAID systems themselves are insanely reliable.
    Cheap SAN like the P2000 .... we know that by collecting anecdotes, and knowing total sales figures, that the failure rates of those observed alone is too high for the entire existing set of products made, and we can safely assume that the number we have not observed is vastly higher. But observation alone tells us that the reliability is not high enough for any production use.

    We lost an entire virtualization platform and had to recover from scratch because the second member of a RAID 10 pair failed after replacing the first and a rebuild initiating. We'll stick with RAID 6 thanks.

    EDIT: The on-site IT and I were well into our coffee chat when the spontaneous beep/beep happened and we were both, WTF?

    See, that's an irrational, emotional reaction that we are trying to avoid. You have one anecdote that tells you nothing, but you make a decision based on it that goes against math and empirical studies. Why?

    The fact that you and possibly your org has actually studied things is important to the discussion.

    We've had enough double disk failures over time to have influenced the decision to drop RAID 5. The RAID 10 failure was icing on the cake. Not an emotional reaction, just one that falls into what we've experienced failure wise across the board.



  • @PhlipElder said in Safe to have a 48TB Windows volume?:

    @scottalanmiller said in Safe to have a 48TB Windows volume?:

    @PhlipElder said in Safe to have a 48TB Windows volume?:

    @scottalanmiller said in Safe to have a 48TB Windows volume?:

    Some examples of things we have math to tell us are good or bad...

    RAID 10 .... we've done massive empirical studies. We know that the RAID systems themselves are insanely reliable.
    Cheap SAN like the P2000 .... we know that by collecting anecdotes, and knowing total sales figures, that the failure rates of those observed alone is too high for the entire existing set of products made, and we can safely assume that the number we have not observed is vastly higher. But observation alone tells us that the reliability is not high enough for any production use.

    We lost an entire virtualization platform and had to recover from scratch because the second member of a RAID 10 pair failed after replacing the first and a rebuild initiating. We'll stick with RAID 6 thanks.

    EDIT: The on-site IT and I were well into our coffee chat when the spontaneous beep/beep happened and we were both, WTF?

    See, that's an irrational, emotional reaction that we are trying to avoid. You have one anecdote that tells you nothing, but you make a decision based on it that goes against math and empirical studies. Why?

    The fact that you and possibly your org has actually studied things is important to the discussion.

    I've published about it and speak about it all the time. The study was massive. And took forever. As you can imagine.



  • @PhlipElder said in Safe to have a 48TB Windows volume?:

    The RAID 10 failure was icing on the cake. Not an emotional reaction, just one that falls into what we've experienced failure wise across the board.

    What math did you use to make a single, very unusual RAID 10 failure lead you to something riskier?

    How can it be non-emotional unless your discovery was that data loss simply didn't affect you and increasing risk was okay to save money on needing fewer disks?


Log in to reply