Is this server strategy reckless and/or insane?



  • I have 2 servers. Other than one having 4 more processor cores total, the servers are identical. Specs are:

    R620
    a: 2x octacore xeon, b: 2x decacore xeon
    128GB ram each
    1GB Perc H710P RAID controller
    2x 256GB Samsung 850 Pros ( Os and installs live here, Raid 1 )
    5x 1TB Samsung 850 Pros ( Data and file uploads live here, Raid 0 ) ( can add 1 more on the decacore and up to 3 more on the octacore later if desired, but this RAID controller gets pretty saturated at 4-5 from what I've read )

    My question is, I like the benefits of not having to leave the box to go from app code to database as the goal of this project is for it to be as absolutely instant and fast feeling as possible, so my plan is to basically configure the servers like so
    Full serverware stack on each ( IIS, app server, MySQL )
    One of the two will be the MySQL master and replicate to the other ( all writes will go to this server )
    The other server will be the image upload and processing box, and final images will all be copied to the other server in the background
    The two machines will be clustered for all web traffic save those two uses ( db writes and file uploads )
    Both will be ready, w/ just a few minutes downtime, to take over the full workload should the other fail, which includes a disk in either Raid 0 failing

    The plan would be
    If any piece of the DB master server fails, it drops out of the picture until I can get the failure resolved, and the 2nd server takes over all duties. The only piece I can't automate is the switch from slave to master, which I would handle manually and up to an hour of downtime is acceptable

    If any piece of the image processing server fails, all traffic would automatically go to the other server until I can resolve the failure, and no perceptible downtime would occur

    The only thing running on these servers will be a hobby project I made, and I'm ok w/ a little downtime in the event of a presumably unlikely hardware failure.

    What do you think? Is this a completely unorthodox approach? I like the idea of most web site requests being able to go through either server so I can make use of the horsepower of both of them and my goal is to make the fastest web site I've ever used so keeping the db and app code that touch each other on the same machine is ideal for me, as is using high-performance-within-my-budget techniques like a Raid 0 of SSDs.

    Let me know what you think, I'm a programmer not a server pro so there may be a ton of negatives I haven't though of in this set up.

    Thanks!



  • I assume you don't care about the data on the RAID 0?



  • You're already saturating your RAID controller, so tossing one more drive in each to make them RAID 5 (assuming these are SSDs), one less thing to rebuild, restore/resync if you have a drive failure.


  • Service Provider

    That's pretty common. Master to slave automated, manual fail back. Works fine as long as you are around most of the time.



  • I was wrong, it looks like you can fully automatically fail over to the slave and set it as the new master w/ the latest MySQL set up, so that makes the decision a bit easier.

    As far as Raid 5 instead of 0, I'd thought that the performance of Raid 5 was absolutely terrible and that almost no one used it anymore, is that a wrong memory?



  • @creayt said in Is this server strategy reckless and/or insane?:

    I was wrong, it looks like you can fully automatically fail over to the slave and set it as the new master w/ the latest MySQL set up, so that makes the decision a bit easier.

    As far as Raid 5 instead of 0, I'd thought that the performance of Raid 5 was absolutely terrible and that almost no one used it anymore, is that a wrong memory?

    No one uses RAID5 with spinning rust.

    RAID5 is perfectly acceptable with SSDs



  • @dashrender I care about it, but because it's automatically replicated after each write there's a fully up-to-date, ready-to-go backup of it the next U down at all times. Could/would also push nightly backups offsite somewhere I suppose.

    Looks like Raid 5 for SSDs can also, possibly, shorten their lifespan because of the parity writes: https://serverfault.com/questions/513909/what-are-the-main-points-to-avoid-raid5-with-ssd


  • Service Provider

    @creayt said in Is this server strategy reckless and/or insane?:

    As far as Raid 5 instead of 0, I'd thought that the performance of Raid 5 was absolutely terrible and that almost no one used it anymore, is that a wrong memory?

    RAID 5 is the standard for SSDs. But you will take performance hits. But whether or not you can tell is the question. On an all flash array with caching, the hit is pretty small.


  • Service Provider

    @creayt said in Is this server strategy reckless and/or insane?:

    @dashrender I care about it, but because it's automatically replicated after each write there's a fully up-to-date, ready-to-go backup of it the next U down at all times. Could/would also push nightly backups offsite somewhere I suppose.

    Looks like Raid 5 for SSDs can also, possibly, shorten their lifespan because of the parity writes: https://serverfault.com/questions/513909/what-are-the-main-points-to-avoid-raid5-with-ssd

    Yes, but with enterprise drives and cache buffering, that's trivial. You are typically looking at decades before failure.



  • @scottalanmiller said in Is this server strategy reckless and/or insane?:

    @creayt said in Is this server strategy reckless and/or insane?:

    @dashrender I care about it, but because it's automatically replicated after each write there's a fully up-to-date, ready-to-go backup of it the next U down at all times. Could/would also push nightly backups offsite somewhere I suppose.

    Looks like Raid 5 for SSDs can also, possibly, shorten their lifespan because of the parity writes: https://serverfault.com/questions/513909/what-are-the-main-points-to-avoid-raid5-with-ssd

    Yes, but with enterprise drives and cache buffering, that's trivial. You are typically looking at decades before failure.

    850 pros are not enterprise drives.


  • Service Provider

    @tim_g said in Is this server strategy reckless and/or insane?:

    @scottalanmiller said in Is this server strategy reckless and/or insane?:

    @creayt said in Is this server strategy reckless and/or insane?:

    @dashrender I care about it, but because it's automatically replicated after each write there's a fully up-to-date, ready-to-go backup of it the next U down at all times. Could/would also push nightly backups offsite somewhere I suppose.

    Looks like Raid 5 for SSDs can also, possibly, shorten their lifespan because of the parity writes: https://serverfault.com/questions/513909/what-are-the-main-points-to-avoid-raid5-with-ssd

    Yes, but with enterprise drives and cache buffering, that's trivial. You are typically looking at decades before failure.

    850 pros are not enterprise drives.

    Whoops, missed that.



  • @tim_g said in Is this server strategy reckless and/or insane?:

    @scottalanmiller said in Is this server strategy reckless and/or insane?:

    @creayt said in Is this server strategy reckless and/or insane?:

    @dashrender I care about it, but because it's automatically replicated after each write there's a fully up-to-date, ready-to-go backup of it the next U down at all times. Could/would also push nightly backups offsite somewhere I suppose.

    Looks like Raid 5 for SSDs can also, possibly, shorten their lifespan because of the parity writes: https://serverfault.com/questions/513909/what-are-the-main-points-to-avoid-raid5-with-ssd

    Yes, but with enterprise drives and cache buffering, that's trivial. You are typically looking at decades before failure.

    850 pros are not enterprise drives.

    I was to slow to respond.. I didn't miss that. 😉



  • Let me ask this.

    The only thing that'll be stored on each Raid 0/5 is

    The MySQL data files ( not the MySQL installation )
    and
    The image uploads

    So if a drive in the Raid 0 fails, I simply replace the drive, recreate the virtual disk, and then copy the database and images, which I think takes just a few minutes w/ two systems of this caliber 1U away from each other especially w/ so many cores to spare ( won't be competing w/ the load of the live site ).

    So, since I have to drive an SSD over to the datacenter 10 minutes away, open the box, and get it in, a few more minutes for the copy feels like it'll be negligibly more time than if it failed w/ a Raid 5, where it would stay online ( though I don't know if my set up lets you do the Raid 5 replacement while the OS is running, maybe it does, or maybe I just hot swap the drive I'm not sure ).

    So, because the full penalty for a Raid 0 failing vs. a Raid 5 in my set up is basically a few more minutes to copy the stuff manually, seems like the performance improvements would be worth the gamble. Is that logic sound or do y'all think just keeping the array online is better so 5 is the way to go anyway?



  • Just an FYI:

     
    Posted by
    DELL-Josh Cr 
    on 16 Mar 2015 15:41 
    
    Hi,
    ...if it is not a Dell drive we won’t have put our firmware on it that is designed for our controllers and we will not have validated it....
    
    Thanks,
    Josh Craig
    Dell EMC | Enterprise Support Services
    Get support on Twitter: @DellCaresPRO
    Download our QRL app: iOS, Android, Windows
    


  • @creayt Also forgot to bring up that Raid 0 also gives me way more capacity right so it'd give me terabyte(s) more before I had to scale to extra hardware? Can't remember how much Raid 5 subtracts.


  • Service Provider

    That's not a horrible recovery strategy. But if the question is performance, how much downtime or effort caused by that offsets the performance difference? That's a real question. Will anyone notice the performance difference day to day? Will they notice five minutes or an hour of downtime? Will you notice having to do all of that work that could have been avoided?

    Those are the real questions.



  • @creayt said in Is this server strategy reckless and/or insane?:

    Let me ask this.

    The only thing that'll be stored on each Raid 0/5 is

    The MySQL data files ( not the MySQL installation )
    and
    The image uploads

    So if a drive in the Raid 0 fails, I simply replace the drive, recreate the virtual disk, and then copy the database and images, which I think takes just a few minutes w/ two systems of this caliber 1U away from each other especially w/ so many cores to spare ( won't be competing w/ the load of the live site ).

    So, since I have to drive an SSD over to the datacenter 10 minutes away, open the box, and get it in, a few more minutes for the copy feels like it'll be negligibly more time than if it failed w/ a Raid 5, where it would stay online ( though I don't know if my set up lets you do the Raid 5 replacement while the OS is running, maybe it does, or maybe I just hot swap the drive I'm not sure ).

    So, because the full penalty for a Raid 0 failing vs. a Raid 5 in my set up is basically a few more minutes to copy the stuff manually, seems like the performance improvements would be worth the gamble. Is that logic sound or do y'all think just keeping the array online is better so 5 is the way to go anyway?

    Keeping the OBR5 online and recovering from that would be faster than having to completely rebuild an OBR0.



  • @tim_g What are the implications of this, do you know? For what it's worth none of these drives do the amber light thing in either server, all green and they report as SSDs etc. in the lifecycle tooling.



  • @creayt said in Is this server strategy reckless and/or insane?:

    @creayt Also forgot to bring up that Raid 0 also gives me way more capacity right so it'd give me terabyte(s) more before I had to scale to extra hardware? Can't remember how much Raid 5 subtracts.

    How much storage does this system need?



  • @dustinb3403 It's a community style site that's some kind of hybrid between Reddit and something like Mango Lassi, so the more users I get, the more content they'll generate ( mostly in the form of MySQL data ) and the more footprint I'll need, eventually having to go cloud probably if it takes off. But will be a huge volume of small database writes happening pretty much 24/7.



  • @creayt said in Is this server strategy reckless and/or insane?:

    Let me ask this.

    The only thing that'll be stored on each Raid 0/5 is

    The MySQL data files ( not the MySQL installation )
    and
    The image uploads

    So if a drive in the Raid 0 fails, I simply replace the drive, recreate the virtual disk, and then copy the database and images, which I think takes just a few minutes w/ two systems of this caliber 1U away from each other especially w/ so many cores to spare ( won't be competing w/ the load of the live site ).

    So, since I have to drive an SSD over to the datacenter 10 minutes away, open the box, and get it in, a few more minutes for the copy feels like it'll be negligibly more time than if it failed w/ a Raid 5, where it would stay online ( though I don't know if my set up lets you do the Raid 5 replacement while the OS is running, maybe it does, or maybe I just hot swap the drive I'm not sure ).

    So, because the full penalty for a Raid 0 failing vs. a Raid 5 in my set up is basically a few more minutes to copy the stuff manually, seems like the performance improvements would be worth the gamble. Is that logic sound or do y'all think just keeping the array online is better so 5 is the way to go anyway?

    as long as you have good backups, I guess this is doable. The cost of the extra drive over the life of the system seems pretty low. I guess I'd have to see how badly the RAID 5 penalty hit versus RAID 0 to see if that drive performance is worth the risk.

    UREs are probably pretty low on these SSDs, but not zero, so something else to consider, what are the chances of a URE killing your RAID 0? (now Scott will educate me that these don't matter 😛 - seriously don't know if do or not)



  • @creayt said in Is this server strategy reckless and/or insane?:

    @creayt Also forgot to bring up that Raid 0 also gives me way more capacity right so it'd give me terabyte(s) more before I had to scale to extra hardware? Can't remember how much Raid 5 subtracts.

    One drive worth.



  • @creayt said in Is this server strategy reckless and/or insane?:

    @tim_g What are the implications of this, do you know? For what it's worth none of these drives do the amber light thing in either server, all green and they report as SSDs etc. in the lifecycle tooling.

    They may work great for 5 years straight... or they may give errors randomly after 5 months for no apparent reason. Performance may be degraded, or it may not. PERC or other features may be lost without Dell's firmware on the SSDs. Your data may be perfectly safe, or it may not be.

    Odds of the above going not in your favor are more likely than having Dell's firmware on them.

    I wouldn't do it on production servers. But it's your call.



  • @dustinb3403 Building the Raid itself takes under 2 minutes, but each server restart seems to take forever ( at least a minute or two or three or four ) because of how slow the configuring RAM and etc. is, so good point.



  • @tim_g "Would not do it" meaning what, you'd buy the Dell certified SSDs? Aren't those like 4-10x the market value/price of similar options?



  • @Dashrender URE's haven't been proven to exist on SSDs, so really it's not even a consideration.

    What matters is if he has an SSD die in OBR0, he's rebuilding if he wants to or not. At 1 AM or at 1PM.

    With OBR5 he at least has a buffer to be able to say, ok need to replace this drive, and do so at a reasonable time. Because if he is down to a single host, and that hosts loses a drive. Then he's done for and has to recover everything.



  • @creayt said in Is this server strategy reckless and/or insane?:

    @tim_g "Would not do it" meaning what, you'd buy the Dell certified SSDs? Aren't those like 4-10x the market value/price of similar options?

    No, because you can only compare the price with other Enterprise class drives with custom firmware for the vendor in question.



  • I wouldn't do RAID 0. Is the DATA copied synchronously or async?



  • @tim_g It'll be using the standard MySQL replication so I believe asychronously but I'm not positive.



  • You and remove the problem of a non vendor drives by using a generic RAID controller instead of a branded one from Dell.


Log in to reply
 

Looks like your connection to MangoLassi was lost, please wait while we try to reconnect.