Responding to "This BS called URE" from Synology Forums

Dashrender

@scottalanmiller said in Responding to "This BS called URE" from Synology Forums:

@scottalanmiller said in Responding to "This BS called URE" from Synology Forums:

Now apply this URE to your hard disk. They say that this magical 10^14 works out to 11.3 TB of information. So on a single 4tb hard drive, you would simply need to fill that drive and read the info back off it 4 times and it will give you an error.

If using single reset RR, yes. And tests of hard drives bears this out. So the math works in real world testing.

I want to confirm - you've seen situations where a 4 TB drive has been filled, then read back 4 times and it fails - regularly?

scottalanmiller

@scottalanmiller said in Responding to "This BS called URE" from Synology Forums:

So what does this URE really mean? if you read a single sector on a disk 10^14 times, statistically you the vast number of disks will start to fail. Now its a statistic, that means you are dealing with a bell curve that will have some disks failing well before that point, and some disks failing well after that point, but the vast majority of disks given a sizeable number of tests will start to fail around that point.

He claims a bell curve. But atomic deterioration does not really have a traditional bell curve. It's far more predictable than that implies. But even if we get a more or less traditional bell, that doesn't change anything statistically. That's all part of the math we are already using.

scottalanmiller

@dashrender said in Responding to "This BS called URE" from Synology Forums:

@scottalanmiller said in Responding to "This BS called URE" from Synology Forums:

@scottalanmiller said in Responding to "This BS called URE" from Synology Forums:

Now apply this URE to your hard disk. They say that this magical 10^14 works out to 11.3 TB of information. So on a single 4tb hard drive, you would simply need to fill that drive and read the info back off it 4 times and it will give you an error.

If using single reset RR, yes. And tests of hard drives bears this out. So the math works in real world testing.

I want to confirm - you've seen situations where a 4 TB drive has been filled, then read back 4 times and it fails - regularly?

Absolutely, everyone has. It's so common even non-IT people are used to it.

scottalanmiller

@scottalanmiller said in Responding to "This BS called URE" from Synology Forums:

But your disk has billions upon billions of sectors, and each and every one of them has it's own URE. so that is why your disk does not fail and seems to keep on working day after day.

Right, disk failures are by each sector, and that's all. Hence why people just live with them and don't bother protecting against them in most cases. A single sector failure is pretty low risk to a normal computer user. This is all exactly as all URE discussions have said. He's not revealing something new, just pointing out the obvious. The drive doesn't fail, one sector gets a URE out of billions that get read.

The idea that a disk would fail from a URE is a weird injection that he has added here to make people think that the unknown other party is insane. But no URE discussion anywhere assumes that a disk will fail. That's why disk failure and URE are two different failure conditions entirely.

scottalanmiller

@scottalanmiller said in Responding to "This BS called URE" from Synology Forums:

So then you get to your RAID scenario. There you have multiple disks, each with their own URE. And again you can't add a statistic up to give you a smaller number. For instance, if given the hypothetical statistic that you had a 1 in 10 chance of a single hard drive failing, that would not mean that if you put 10 drives into a RAID that one of those drives was defective.

This goes totally off of the rails. No one is discussing hard drives failing or defects. UREs are not defects. Hitting a URE is not a drive failure. The URE rate is the rate at which perfectly good, healthy drives have sector level errors from which recovery is not possible. Hitting a URE is part of drive usage, it is not a defect. It IS an error, but storage error rates are a rate, not a failure of the overall product. It would be like saying your spark plug misfires every one billion cycles, but only once and keeps going. That does not mean your engine has failed or that the spark plug is defective. There is just a 1/x chance that it won't fire that one time. That's just how all things work.

Now if you have a 1/10 chance of a hard drive currently having a URE on it (it doesn't work that way at all, UREs aren't an on-disk artefact, but let's just humor him) and you put ten of those drives into an array, then you definitely expect with very high certainty that there is a URE lurking actively somewhere in that array. Obviously. This isn't hard math at that point. He seems really lost here.

scottalanmiller

@scottalanmiller said in Responding to "This BS called URE" from Synology Forums:

So how in any reasonable logic thinking way could Robin Harris get the idea that because a hard drive manufacturer publishes a URE of 10^14 that having 5 drives of 3tb would mean that in a rebuild you will fail? YOU CANNOT ADD STATISTICS TOGETHER!

He repeats "you cannot add statistics together" in the hopes that in doing so, we will think that someone did. But no one did, he's playing to the reader's emotions.

Robin Harris has covered this extensively and all of Robin's numbers give a chance of failure. Exactly the thing that the OP here is alluding to wanting. He just pretends that he didn't get it. Either he's just trying to be a jerk and trick people, or he doesn't see statistical failure as a chance but as a sure thing. Robin always presents the math as "and this is the chance that you will hit the error", always. All of Robin's papers with this math are linked in the MangoLassi RAID link list for reference.

So the real question is, how can you with any reasonable logic, not see how likely the URE is to be hit with the incredibly large drive sizes that we have today?

scottalanmiller

@scottalanmiller said in Responding to "This BS called URE" from Synology Forums:

Stop worrying about someone's misunderstanding of simple mathematics and just start using the devices that you have.

He ends with "don't worry about being IT professionals, don't bother to protect your data, just trust your vendors to magically provide protection that you didn't ask for, pay for, nor did they claim to give."

It's super weird to say to trust the devices here, when the device makers are the ones warning us of the risks!

scottalanmiller

In the second response, Charles Hooper references my paper on the same:

Roadkill401,
Is this the issue that you are describing?
http://www.smbitjournal.com/2012/05/when-no-redundancy-is-more-reliable/

"What happens that scares us during a RAID 5 resilver operation is that an unrecoverable read error (URE) can occur. When it does the resilver operation halts and the array is left in a useless state – all data on the array is lost. On common SATA drives the rate of URE is 10^14, or once every twelve terabytes of read operations. That means that a six terabyte array being resilvered has a roughly fifty percent chance of hitting a URE and failing."

I have a degree in mathematics - but I have been focused on computer technology for roughly the last 20 years (so my mathematics skills are a bit rusty).

I believe that you are correct to a degree. In the above quoted example, there is not a roughly 50 percent chance of hitting a URE and having the array fail during the rebuild (resilver). Just as it is possible to roll a six sided die 10 times and never have the number six come up on top - it is a problem of probability, not straight addition and division. Also keep in mind that a drive's actual URE statistic does not remain constant through the life of the drive - the actual URE statistic decays as the drive ages.

Let's use a simple example that I have posted on the Synology forums before. Consider a four drive RAID 5 (SHR) array composed of 2TB drives. When one drive fails, that RAID 5 array has roughly 48,000,000,000,000 data bits that must be read successfully without a URE for the array to rebuild successfully when the failed drive is replaced. Using just the URE statistic provided by drive manufacturers, drives in this RAID 5 array with a one URE in 10^14 rating have a roughly 38.1% chance of failing to successfully rebuild when the failed drive is replaced. Here is the equation:
(1 - (99,999,999,999,999 / 100,000,000,000,000) ^ 48,000,000,000,000) = 0.380979164

As you stated, drives are read a sector at a time, not a bit at a time. Most drives are now offered with 4KB sector sizes, rather than the older 512 byte sector size, so the drives with a one URE in 10^14 bit rating actually have a one URE in 3,051,757,813 4KB sector rating. In the same four drive RAID 5 array, there are roughly 1,464,843,750 4KB sectors in the non-failed drives. Again there is a roughly 38.1% chance of failing to successfully rebuild when the failed drive is replaced. Here is the equation:
(1 - (3,051,757,812 / 3,051,757,813) ^ 1,464,843,750 = 0.381216604

For comparison, a four drive RAID 10 array composed of 2TB drives has a roughly 14.8% chance of failure during the rebuild:
(1 - (99,999,999,999,999 / 100,000,000,000,000) ^ 16,000,000,000,000) = 14.77%

For arrays with a larger number of drives the difference between RAID 10 and RAID 5 (SHR) becomes even more significant because only a single other drive in a RAID 10 array must be fully read error free, while in a RAID 5 array all other drives in the array must be fully read error free.

IRJ

@scottalanmiller said in Responding to "This BS called URE" from Synology Forums:

So this guy trying to claim he knows something about math and computers makes a thread on the Synology forums seven years ago and @Dashrender found it today and it seems like it never got any great responses and it is bad to leave this kind of misinformation out there and doing these analyses are always good, so let's break it down (since moving from SW to ML, the amount of "correcting dumbassery and people trying to mislead others has all but disappeared so we don't get to do this much.)

https://community.synology.com/enu/forum/17/post/79816

Why not reply to original post on synology. It doesn't make sense to address it here

scottalanmiller

@scottalanmiller said in Responding to "This BS called URE" from Synology Forums:

I believe that you are correct to a degree. In the above quoted example, there is not a roughly 50 percent chance of hitting a URE and having the array fail during the rebuild (resilver). Just as it is possible to roll a six sided die 10 times and never have the number six come up on top - it is a problem of probability, not straight addition and division.

So yes and no. Let's start with the die example. There is the statistics, and there is the "chance". There is a "chance" that you will roll a single die a billion times and never get a six. Yes. Obviously. We all know that. But statistically, you will get it pretty quickly.

Quick stats math...

(5/6)(5/6)(5/6)(5/6)(5/6)*(5/6) = 15625/ 46656 = .33 chance of NOT having it happen. so

1 - .33 = .66 or 66% chance of hitting the "die URE" of a six.

That's right, it's not lower than 50% in the dice example, it's higher, a lot higher. Yes, there is still a chance, a decent one, that you won't hit it. But the chances are if you roll a die ten times that you will get a six. Very good chances.

The "roughly 50%" number was based on statistics math. It just happens to be that at around the 50% chance mark additive numbers and statistical numbers are pretty close. They diverge as you leave the top of the bell in either direction, but they are pretty much on top of each other right in the middle. So because the writers here likely don't know statistical math, all they can do is see that additive math would have gotten us into the same ballpark.

scottalanmiller

@irj said in Responding to "This BS called URE" from Synology Forums:

@scottalanmiller said in Responding to "This BS called URE" from Synology Forums:

So this guy trying to claim he knows something about math and computers makes a thread on the Synology forums seven years ago and @Dashrender found it today and it seems like it never got any great responses and it is bad to leave this kind of misinformation out there and doing these analyses are always good, so let's break it down (since moving from SW to ML, the amount of "correcting dumbassery and people trying to mislead others has all but disappeared so we don't get to do this much.)

https://community.synology.com/enu/forum/17/post/79816

Why not reply to original post on synology. It doesn't make sense to address it here

I tried, they don't let you. Since they codified it in their archives, I wanted to make sure it was addressed somewhere, at least.

scottalanmiller

@irj said in Responding to "This BS called URE" from Synology Forums:

@scottalanmiller said in Responding to "This BS called URE" from Synology Forums:

So this guy trying to claim he knows something about math and computers makes a thread on the Synology forums seven years ago and @Dashrender found it today and it seems like it never got any great responses and it is bad to leave this kind of misinformation out there and doing these analyses are always good, so let's break it down (since moving from SW to ML, the amount of "correcting dumbassery and people trying to mislead others has all but disappeared so we don't get to do this much.)

https://community.synology.com/enu/forum/17/post/79816

Why not reply to original post on synology. It doesn't make sense to address it here

Trying again. I tried to put all this there but the "comment" buttons didn't do anything. Maybe they were having an issue. Let's see...

scottalanmiller

No luck, even when signed in the "comment" and "reply" fields appear to be disabled. Which makes sense, this is their legacy forum.

scottalanmiller

@scottalanmiller said in Responding to "This BS called URE" from Synology Forums:

As you stated, drives are read a sector at a time, not a bit at a time. Most drives are now offered with 4KB sector sizes, rather than the older 512 byte sector size, so the drives with a one URE in 10^14 bit rating actually have a one URE in 3,051,757,813 4KB sector rating. In the same four drive RAID 5 array, there are roughly 1,464,843,750 4KB sectors in the non-failed drives. Again there is a roughly 38.1% chance of failing to successfully rebuild when the failed drive is replaced. Here is the equation:
(1 - (3,051,757,812 / 3,051,757,813) ^ 1,464,843,750 = 0.381216604

This is a little confusing. While bigger sectors does mean bigger potential failures, it does not change the failure rate overall because URE is measured in bit reads, not sector reads.

Dashrender

@scottalanmiller said in Responding to "This BS called URE" from Synology Forums:

@dashrender said in Responding to "This BS called URE" from Synology Forums:

@scottalanmiller said in Responding to "This BS called URE" from Synology Forums:

@scottalanmiller said in Responding to "This BS called URE" from Synology Forums:

Now apply this URE to your hard disk. They say that this magical 10^14 works out to 11.3 TB of information. So on a single 4tb hard drive, you would simply need to fill that drive and read the info back off it 4 times and it will give you an error.

If using single reset RR, yes. And tests of hard drives bears this out. So the math works in real world testing.

I want to confirm - you've seen situations where a 4 TB drive has been filled, then read back 4 times and it fails - regularly?

Absolutely, everyone has. It's so common even non-IT people are used to it.

I wasn't thinking.. of course, I probably have run into this on a single drive and didn't realize what the issue was - a single file failed.. not generally a huge deal.

scottalanmiller

@dashrender said in Responding to "This BS called URE" from Synology Forums:

@scottalanmiller said in Responding to "This BS called URE" from Synology Forums:

@dashrender said in Responding to "This BS called URE" from Synology Forums:

@scottalanmiller said in Responding to "This BS called URE" from Synology Forums:

@scottalanmiller said in Responding to "This BS called URE" from Synology Forums:

Now apply this URE to your hard disk. They say that this magical 10^14 works out to 11.3 TB of information. So on a single 4tb hard drive, you would simply need to fill that drive and read the info back off it 4 times and it will give you an error.

If using single reset RR, yes. And tests of hard drives bears this out. So the math works in real world testing.

I want to confirm - you've seen situations where a 4 TB drive has been filled, then read back 4 times and it fails - regularly?

Absolutely, everyone has. It's so common even non-IT people are used to it.

I wasn't thinking.. of course, I probably have run into this on a single drive and didn't realize what the issue was - a single file failed.. not generally a huge deal.

Right, for most people, it happens most often in the used portion of a drive. Humans rarely read an entire drive full of data. if we image a drive, most of the space is never read back until after it has been overwritten again.

When we do, most files could be corrupt and we wouldn't care. I get this regularly with my video games because they take up so much space (many TB) and Steam just redownloads the corrupt files semi-automatically to fix the issue. Only the save file really matters, and those are generally tiny.

Image and video files are what affect normal users the most and because it is normally just seen as an artefact in the video or a smudge on the image people don't really care.

When it is a system file, we can normally repair it as it isn't a unique file. For normal users, it is amazing how little UREs actually matter.

scottalanmiller

@scottalanmiller said in Responding to "This BS called URE" from Synology Forums:

For comparison, a four drive RAID 10 array composed of 2TB drives has a roughly 14.8% chance of failure during the rebuild:
(1 - (99,999,999,999,999 / 100,000,000,000,000) ^ 16,000,000,000,000) = 14.77%

This is confusing and wrong. RAID 10 doesn't rebuild. Only the underlying RAID 1 does. And RAID 1 doesn't have parity.

So the risk of hitting the URE is only of a 2TB space, not a 6TB space. The size of the overall RAID 10 isn't relevant like it is with a parity array, so all that info is red herring. Also this assumes single mirror RAID 1, but with more we protect against URE. So we have to be specific. Not everyone with RAID 1 does only a single mirror (but most do, sure.)

Then there is the behaviour risk. Parity RAID is assumed to have to drop the array in case of an unprotected URE because it affects an unknown amount of data as the entire array is a "single corrupt file", but in a mirror, it can be sector copied as is and has the option to behave the same as a single drive does when there is a URE.

So nothing about the comparison is useful. The math shows the chances of the array hitting a protected URE during multiple reads, not an unprotected URE during a resilver like the RAID 5 exampe. Apples and oranges.

IRJ

@scottalanmiller said in Responding to "This BS called URE" from Synology Forums:

@irj said in Responding to "This BS called URE" from Synology Forums:

@scottalanmiller said in Responding to "This BS called URE" from Synology Forums:

So this guy trying to claim he knows something about math and computers makes a thread on the Synology forums seven years ago and @Dashrender found it today and it seems like it never got any great responses and it is bad to leave this kind of misinformation out there and doing these analyses are always good, so let's break it down (since moving from SW to ML, the amount of "correcting dumbassery and people trying to mislead others has all but disappeared so we don't get to do this much.)

https://community.synology.com/enu/forum/17/post/79816

Why not reply to original post on synology. It doesn't make sense to address it here

Trying again. I tried to put all this there but the "comment" buttons didn't do anything. Maybe they were having an issue. Let's see...

Ah ok. I see

scottalanmiller

The Roadkill401 states that he doesn't even know what the issue is that we are discussing, which exposes why he's giving us this bad advice. What is weird is he gives the advice to do something crazy reckless, THEN admits he doesn't even know what the risk is we are discussing.

But here is where I have the problem..

According to you, we are simply doomed by your calculation in reading any disks you are going to get a read error eventually and sooner than you really would hope for. Just hope that the error is in something unimportant like a jpeg file rather than something very important like your tax return software.

Based on your method of calculation, after reading 6tb of data from a hard disk, you have about a 50/50 chance that the data that you have read so far has a bit error inside of it. (1/10^14)4.810^13

The issue then is not that you could get an error, but how the Synology deals with these bit errors. Regardless of doing any rebuild you are going to see a read error that will not be detected by the RAID as every read of data that you take off the drive is not verified against any CRC calculation to determine if what you are reading is accurate to what the drive thought was written to it.
The premise is that when doing a RAID rebuild, that the process will stop on the occurrence of one of these read errors that WILL happen at some point in time of the first 11.3TB of data read off any of the disks. But why would this happen? Does the disk itself know that the data it just read was faulty and give an error to the Synology? Isn't that really a MTBF ?? Or is it just that when doing the CRC calculation to try and rebuild the missing block, that the calculation will result in a value that just is not possible so it will fail? But that doesn't make any sense either as all you are doing is for example, reading a bit that should say 10110000 and getting 10010000. A single bit error that will give you the wrong result but why would it actually stop anything.

So all you are really assured is that doing a rebuild, you are likely to get a bit error that will have a chance of changing some file at some point on the RAID disk. But the chances are about the same as you reading a file off the disk and getting a bit error and not knowing it, and then saving that now wrong file back to the disk.

I am perplexed then at what the issue really is ?

scottalanmiller

@scottalanmiller said in Responding to "This BS called URE" from Synology Forums:

The premise is that when doing a RAID rebuild, that the process will stop on the occurrence of one of these read errors that WILL happen at some point in time of the first 11.3TB of data read off any of the disks. But why would this happen? Does the disk itself know that the data it just read was faulty and give an error to the Synology? Isn't that really a MTBF ?? Or is it just that when doing the CRC calculation to try and rebuild the missing block, that the calculation will result in a value that just is not possible so it will fail? But that doesn't make any sense either as all you are doing is for example, reading a bit that should say 10110000 and getting 10010000. A single bit error that will give you the wrong result but why would it actually stop anything.

So all you are really assured is that doing a rebuild, you are likely to get a bit error that will have a chance of changing some file at some point on the RAID disk. But the chances are about the same as you reading a file off the disk and getting a bit error and not knowing it, and then saving that now wrong file back to the disk.

I am perplexed then at what the issue really is ?

So there is "premise" and "what the issue really is."

First, it is not a premise, it is how MD RAID, and all enterprise class RAID, works. In parity RAID we don't know what the impact is because the RAID system has no knowledge of the data on top of it and the array acts like it is a file (it is actually a volume, but the difference is the same.) When an exposed URE is encountered, whatever scale the layer is that is affected, is lost. In the case of mirrored RAID or no RAID, it is a sector. One bit is bad, the sector is scrapped. In the case of parity RAID, the minimum size above it is the volume which is mapped to the array. So the entire array is lost because it is a single unit that cannot be safely calculated. This is just parity basics, it's not a premise.

So what the issue really is.... is what has been stated ad nauseum and he is ignoring... that an exposed URE on a parity array being rebuilt causes the array to be in an unsafe state and dropped. It isn't that the array makers want to lose all of that data, it is just the granularity at which it can no longer be trusted.