Is this server strategy reckless and/or insane?

scottalanmiller

@creayt said in Is this server strategy reckless and/or insane?:

As far as Raid 5 instead of 0, I'd thought that the performance of Raid 5 was absolutely terrible and that almost no one used it anymore, is that a wrong memory?

RAID 5 is the standard for SSDs. But you will take performance hits. But whether or not you can tell is the question. On an all flash array with caching, the hit is pretty small.

scottalanmiller

@creayt said in Is this server strategy reckless and/or insane?:

@dashrender I care about it, but because it's automatically replicated after each write there's a fully up-to-date, ready-to-go backup of it the next U down at all times. Could/would also push nightly backups offsite somewhere I suppose.

Looks like Raid 5 for SSDs can also, possibly, shorten their lifespan because of the parity writes: https://serverfault.com/questions/513909/what-are-the-main-points-to-avoid-raid5-with-ssd

Yes, but with enterprise drives and cache buffering, that's trivial. You are typically looking at decades before failure.

Obsolesce

@scottalanmiller said in Is this server strategy reckless and/or insane?:

@creayt said in Is this server strategy reckless and/or insane?:

@dashrender I care about it, but because it's automatically replicated after each write there's a fully up-to-date, ready-to-go backup of it the next U down at all times. Could/would also push nightly backups offsite somewhere I suppose.

Looks like Raid 5 for SSDs can also, possibly, shorten their lifespan because of the parity writes: https://serverfault.com/questions/513909/what-are-the-main-points-to-avoid-raid5-with-ssd

Yes, but with enterprise drives and cache buffering, that's trivial. You are typically looking at decades before failure.

850 pros are not enterprise drives.

scottalanmiller

@tim_g said in Is this server strategy reckless and/or insane?:

@scottalanmiller said in Is this server strategy reckless and/or insane?:

@creayt said in Is this server strategy reckless and/or insane?:

@dashrender I care about it, but because it's automatically replicated after each write there's a fully up-to-date, ready-to-go backup of it the next U down at all times. Could/would also push nightly backups offsite somewhere I suppose.

Looks like Raid 5 for SSDs can also, possibly, shorten their lifespan because of the parity writes: https://serverfault.com/questions/513909/what-are-the-main-points-to-avoid-raid5-with-ssd

Yes, but with enterprise drives and cache buffering, that's trivial. You are typically looking at decades before failure.

850 pros are not enterprise drives.

Whoops, missed that.

Dashrender

@tim_g said in Is this server strategy reckless and/or insane?:

@scottalanmiller said in Is this server strategy reckless and/or insane?:

@creayt said in Is this server strategy reckless and/or insane?:

@dashrender I care about it, but because it's automatically replicated after each write there's a fully up-to-date, ready-to-go backup of it the next U down at all times. Could/would also push nightly backups offsite somewhere I suppose.

Looks like Raid 5 for SSDs can also, possibly, shorten their lifespan because of the parity writes: https://serverfault.com/questions/513909/what-are-the-main-points-to-avoid-raid5-with-ssd

Yes, but with enterprise drives and cache buffering, that's trivial. You are typically looking at decades before failure.

850 pros are not enterprise drives.

I was to slow to respond.. I didn't miss that.

creayt

Let me ask this.

The only thing that'll be stored on each Raid 0/5 is

The MySQL data files ( not the MySQL installation )
and
The image uploads

So if a drive in the Raid 0 fails, I simply replace the drive, recreate the virtual disk, and then copy the database and images, which I think takes just a few minutes w/ two systems of this caliber 1U away from each other especially w/ so many cores to spare ( won't be competing w/ the load of the live site ).

So, since I have to drive an SSD over to the datacenter 10 minutes away, open the box, and get it in, a few more minutes for the copy feels like it'll be negligibly more time than if it failed w/ a Raid 5, where it would stay online ( though I don't know if my set up lets you do the Raid 5 replacement while the OS is running, maybe it does, or maybe I just hot swap the drive I'm not sure ).

So, because the full penalty for a Raid 0 failing vs. a Raid 5 in my set up is basically a few more minutes to copy the stuff manually, seems like the performance improvements would be worth the gamble. Is that logic sound or do y'all think just keeping the array online is better so 5 is the way to go anyway?

Obsolesce

Just an FYI:

 
Posted by
DELL-Josh Cr 
on 16 Mar 2015 15:41 

Hi,
...if it is not a Dell drive we won’t have put our firmware on it that is designed for our controllers and we will not have validated it....

Thanks,
Josh Craig
Dell EMC | Enterprise Support Services
Get support on Twitter: @DellCaresPRO
Download our QRL app: iOS, Android, Windows

creayt

@creayt Also forgot to bring up that Raid 0 also gives me way more capacity right so it'd give me terabyte(s) more before I had to scale to extra hardware? Can't remember how much Raid 5 subtracts.

scottalanmiller

That's not a horrible recovery strategy. But if the question is performance, how much downtime or effort caused by that offsets the performance difference? That's a real question. Will anyone notice the performance difference day to day? Will they notice five minutes or an hour of downtime? Will you notice having to do all of that work that could have been avoided?

Those are the real questions.

DustinB3403

@creayt said in Is this server strategy reckless and/or insane?:

Let me ask this.

The only thing that'll be stored on each Raid 0/5 is

The MySQL data files ( not the MySQL installation )
and
The image uploads

So if a drive in the Raid 0 fails, I simply replace the drive, recreate the virtual disk, and then copy the database and images, which I think takes just a few minutes w/ two systems of this caliber 1U away from each other especially w/ so many cores to spare ( won't be competing w/ the load of the live site ).

So, since I have to drive an SSD over to the datacenter 10 minutes away, open the box, and get it in, a few more minutes for the copy feels like it'll be negligibly more time than if it failed w/ a Raid 5, where it would stay online ( though I don't know if my set up lets you do the Raid 5 replacement while the OS is running, maybe it does, or maybe I just hot swap the drive I'm not sure ).

So, because the full penalty for a Raid 0 failing vs. a Raid 5 in my set up is basically a few more minutes to copy the stuff manually, seems like the performance improvements would be worth the gamble. Is that logic sound or do y'all think just keeping the array online is better so 5 is the way to go anyway?

Keeping the OBR5 online and recovering from that would be faster than having to completely rebuild an OBR0.

creayt

@tim_g What are the implications of this, do you know? For what it's worth none of these drives do the amber light thing in either server, all green and they report as SSDs etc. in the lifecycle tooling.

DustinB3403

@creayt said in Is this server strategy reckless and/or insane?:

@creayt Also forgot to bring up that Raid 0 also gives me way more capacity right so it'd give me terabyte(s) more before I had to scale to extra hardware? Can't remember how much Raid 5 subtracts.

How much storage does this system need?

creayt

@dustinb3403 It's a community style site that's some kind of hybrid between Reddit and something like Mango Lassi, so the more users I get, the more content they'll generate ( mostly in the form of MySQL data ) and the more footprint I'll need, eventually having to go cloud probably if it takes off. But will be a huge volume of small database writes happening pretty much 24/7.

Dashrender

@creayt said in Is this server strategy reckless and/or insane?:

Let me ask this.

The only thing that'll be stored on each Raid 0/5 is

The MySQL data files ( not the MySQL installation )
and
The image uploads

So if a drive in the Raid 0 fails, I simply replace the drive, recreate the virtual disk, and then copy the database and images, which I think takes just a few minutes w/ two systems of this caliber 1U away from each other especially w/ so many cores to spare ( won't be competing w/ the load of the live site ).

So, since I have to drive an SSD over to the datacenter 10 minutes away, open the box, and get it in, a few more minutes for the copy feels like it'll be negligibly more time than if it failed w/ a Raid 5, where it would stay online ( though I don't know if my set up lets you do the Raid 5 replacement while the OS is running, maybe it does, or maybe I just hot swap the drive I'm not sure ).

So, because the full penalty for a Raid 0 failing vs. a Raid 5 in my set up is basically a few more minutes to copy the stuff manually, seems like the performance improvements would be worth the gamble. Is that logic sound or do y'all think just keeping the array online is better so 5 is the way to go anyway?

as long as you have good backups, I guess this is doable. The cost of the extra drive over the life of the system seems pretty low. I guess I'd have to see how badly the RAID 5 penalty hit versus RAID 0 to see if that drive performance is worth the risk.

UREs are probably pretty low on these SSDs, but not zero, so something else to consider, what are the chances of a URE killing your RAID 0? (now Scott will educate me that these don't matter - seriously don't know if do or not)

Dashrender

@creayt said in Is this server strategy reckless and/or insane?:

@creayt Also forgot to bring up that Raid 0 also gives me way more capacity right so it'd give me terabyte(s) more before I had to scale to extra hardware? Can't remember how much Raid 5 subtracts.

One drive worth.

Obsolesce

@creayt said in Is this server strategy reckless and/or insane?:

@tim_g What are the implications of this, do you know? For what it's worth none of these drives do the amber light thing in either server, all green and they report as SSDs etc. in the lifecycle tooling.

They may work great for 5 years straight... or they may give errors randomly after 5 months for no apparent reason. Performance may be degraded, or it may not. PERC or other features may be lost without Dell's firmware on the SSDs. Your data may be perfectly safe, or it may not be.

Odds of the above going not in your favor are more likely than having Dell's firmware on them.

I wouldn't do it on production servers. But it's your call.

creayt

@dustinb3403 Building the Raid itself takes under 2 minutes, but each server restart seems to take forever ( at least a minute or two or three or four ) because of how slow the configuring RAM and etc. is, so good point.

creayt

@tim_g "Would not do it" meaning what, you'd buy the Dell certified SSDs? Aren't those like 4-10x the market value/price of similar options?

DustinB3403

@Dashrender URE's haven't been proven to exist on SSDs, so really it's not even a consideration.

What matters is if he has an SSD die in OBR0, he's rebuilding if he wants to or not. At 1 AM or at 1PM.

With OBR5 he at least has a buffer to be able to say, ok need to replace this drive, and do so at a reasonable time. Because if he is down to a single host, and that hosts loses a drive. Then he's done for and has to recover everything.

Dashrender

@creayt said in Is this server strategy reckless and/or insane?:

@tim_g "Would not do it" meaning what, you'd buy the Dell certified SSDs? Aren't those like 4-10x the market value/price of similar options?

No, because you can only compare the price with other Enterprise class drives with custom firmware for the vendor in question.