Application clustering VS RAID with modern SSD
-
There are multiple layers that can address needs and remove a need for RAID. Essentially, all of them are things that are "more than" RAID.
-
In many cases, such as a two node application cluster, you can certain do without RAID but typically do so by effectively replicating RAID in a different way.
So let's take a MySQL two node cluster as an example. Doing MySQL clustering is fine but for all intents and purposes requires a combination of manually mirroring and the application clustering acting like Network RAID 1. It's not actually RAID, but is acting like it.
The reason you use RAID normally, local RAID that is, is to avoid node rebuilds. Whether it's Network RAID, or application mirroring or whatever, there is an impact for a node rebuild. If you have no local RAID, those rebuilds become incredibly frequent rather than rare to non-existant.
-
Some things like stateless app servers, rebuilds might not affect the app at all. So skipping RAID might be totally feasible with little to no impact. But something like a database where other nodes have to recreate data over the network, it might be pretty negative.
-
Thanks @scottalanmiller . I thin I’ll start with a DRBD over two single PCIe NVMe cards (one in each nodes) synced through an infiniband link (infiniband is cheap today!), and I will slowly move every capable workloads to application clustering.
-
@francesco-provino said in Application clustering VS RAID with modern SSD:
Thanks @scottalanmiller . I thin I’ll start with a DRBD over two single PCIe NVMe cards (one in each nodes) synced through an infiniband link (infiniband is cheap today!), and I will slowly move every capable workloads to application clustering.
Just remember, application clustering at that level is normally a large cost, whereas RAID is a low cost. And application clustering is slow, while RAID is fast.
But DRBD cannot be part of application clustering. DRBD is a RAID system. So you can, of course, use DRBD and application clustering, but they are two very different things.
Local RAID is about speed and low cost. DRBD is Network RAID, so high cost and introduces latency. Application clustering doesn't need RAID of any sort, as it is already clustered. You would use totally independent storage for application clustering.
-
But consider cost and risk of traditional RAID 1 vs. mirrored application clustering for a workload like MariaDB (just as a sample.)
Base Server: $10K
Application Clustering: You need two servers, so your cost is $20,000. And that's assuming application clustering is available for the workload, and free. It is free with MariaDB, so this is a good use case.
Traditional RAID: You need an extra SSD for your one server, so say add $500 onto your base cost for a total of $10,500. That's a fraction of the cost of the application clustering.
-
Performance:
Application Clustering: Because data has to be synced over the network, there is a performance hit from application clustering. For enough money, you can minimize this greatly, but it just costs more and more to do so.
Traditional RAID: RAID 1 is faster than no-RAID. And moving to things like RAID 10 can speed you up even more. So rather than taking a performance hit, RAID for protection of this nature will speed you up.
-
Reliability:
Application Clustering requires everything be duplicated, even CPUs and RAM, so there are some benefits to reliability improvements from the high cost of redudancy. But typically these are minor, as the extra redundancy is typically around pieces that rarely fail. It's a brute force redundancy, rather than a finesse redundancy.
RAID targets the pieces of the system that are most fragile and critical - the storage. It is the drives failing alone that causes full data loss, and drives represent the majority of hardware failures. So you get 99% of the protection, at a fraction of the price.
Because RAID is so mature and reliable, there is an argument that that combined with its insane speed, cache protection options and such will actually be safer than application layer protections that are comparable.
-
Effort:
Application Clustering requires a lot of expertise, and unique expertise to each and every workload, which must then be monitored, maintained, and updated to keep working. This often triples or quadruples the effort to build and maintain a workload and in extreme cases can be far worse. This is an ongoing effort requiring expertise around maintaining clustering and dealing with edge situations, software changes, and so forth. This is generally outside of the skill set of many IT shops, depending on the workloads. Some clustering, like Windows AD is really simple, some like many databases, is very hard.
RAID zero effort. Tell it to turn on, ignore it. There is nothing to know or do and the system can be safely turned over even to non-technical staff to maintain.
-
Cost of Licensing:
Application this is often a costly add on to many software products (and is not always available), and often requires extra software purchases. For example, with MS SQL Server it generally requires more Windows Server and SQL Server licenses, plus additional cost for the application clustering layer. So for many workloads, and any on Windows, the licensing cost soars rapidly.
RAID no known products have any licensing costs tied to block storage redundancy. So there is no cost of this in the real world.
-
@scottalanmiller I see your points, but let me just add some additional information about the current configuration:
- we already have three identical server and one NVMe PCIe card;
- I want to use DRBD replication only for stuff that cannot be made high-available without upgrades like SAL server standard etc. Thinking of use syncthing for file replication.
-
@scottalanmiller said in Application clustering VS RAID with modern SSD:
But consider cost and risk of traditional RAID 1 vs. mirrored application clustering for a workload like MariaDB (just as a sample.)
Base Server: $10K
Application Clustering: You need two servers, so your cost is $20,000. And that's assuming application clustering is available for the workload, and free. It is free with MariaDB, so this is a good use case.
Traditional RAID: You need an extra SSD for your one server, so say add $500 onto your base cost for a total of $10,500. That's a fraction of the cost of the application clustering.
I already have servers, they are the same spec and out of vendor support.
-
@scottalanmiller said in Application clustering VS RAID with modern SSD:
Performance:
Application Clustering: Because data has to be synced over the network, there is a performance hit from application clustering. For enough money, you can minimize this greatly, but it just costs more and more to do so.
Traditional RAID: RAID 1 is faster than no-RAID. And moving to things like RAID 10 can speed you up even more. So rather than taking a performance hit, RAID for protection of this nature will speed you up.
Async replication has almost NO performance hit on the master.
-
@scottalanmiller said in Application clustering VS RAID with modern SSD:
Reliability:
Application Clustering requires everything be duplicated, even CPUs and RAM, so there are some benefits to reliability improvements from the high cost of redudancy. But typically these are minor, as the extra redundancy is typically around pieces that rarely fail. It's a brute force redundancy, rather than a finesse redundancy.
RAID targets the pieces of the system that are most fragile and critical - the storage. It is the drives failing alone that causes full data loss, and drives represent the majority of hardware failures. So you get 99% of the protection, at a fraction of the price.
Because RAID is so mature and reliable, there is an argument that that combined with its insane speed, cache protection options and such will actually be safer than application layer protections that are comparable.
This is unfair, you are really comparing apple to oranges: in one case you have a completely shared-nothing cluster, in the other you are just protected from storage disk failure. What if the cpu/mobo/controller/psu/etc fail?
-
@scottalanmiller said in Application clustering VS RAID with modern SSD:
Effort:
Application Clustering requires a lot of expertise, and unique expertise to each and every workload, which must then be monitored, maintained, and updated to keep working. This often triples or quadruples the effort to build and maintain a workload and in extreme cases can be far worse. This is an ongoing effort requiring expertise around maintaining clustering and dealing with edge situations, software changes, and so forth. This is generally outside of the skill set of many IT shops, depending on the workloads. Some clustering, like Windows AD is really simple, some like many databases, is very hard.
RAID zero effort. Tell it to turn on, ignore it. There is nothing to know or do and the system can be safely turned over even to non-technical staff to maintain.
Mostly true, but very different stuff.
-
@scottalanmiller said in Application clustering VS RAID with modern SSD:
Cost of Licensing:
Application this is often a costly add on to many software products (and is not always available), and often requires extra software purchases. For example, with MS SQL Server it generally requires more Windows Server and SQL Server licenses, plus additional cost for the application clustering layer. So for many workloads, and any on Windows, the licensing cost soars rapidly.
RAID no known products have any licensing costs tied to block storage redundancy. So there is no cost of this in the real world.
This is true, and I’m trying ti avoid that cost via drbd replication.
-
@francesco-provino said in Application clustering VS RAID with modern SSD:
- I want to use DRBD replication only for stuff that cannot be made high-available without upgrades like SAL server standard etc. Thinking of use syncthing for file replication.
That can work, but things like RSYNC are often better for that. Less latency.
-
@francesco-provino said in Application clustering VS RAID with modern SSD:
@scottalanmiller said in Application clustering VS RAID with modern SSD:
But consider cost and risk of traditional RAID 1 vs. mirrored application clustering for a workload like MariaDB (just as a sample.)
Base Server: $10K
Application Clustering: You need two servers, so your cost is $20,000. And that's assuming application clustering is available for the workload, and free. It is free with MariaDB, so this is a good use case.
Traditional RAID: You need an extra SSD for your one server, so say add $500 onto your base cost for a total of $10,500. That's a fraction of the cost of the application clustering.
I already have servers, they are the same spec and out of vendor support.
That's very different. If the goal is to use "whatever hardware is lying around" rather than designing for a specific use case, then anything that fits the needs of the hardware might make sense.
-
@francesco-provino said in Application clustering VS RAID with modern SSD:
@scottalanmiller said in Application clustering VS RAID with modern SSD:
Performance:
Application Clustering: Because data has to be synced over the network, there is a performance hit from application clustering. For enough money, you can minimize this greatly, but it just costs more and more to do so.
Traditional RAID: RAID 1 is faster than no-RAID. And moving to things like RAID 10 can speed you up even more. So rather than taking a performance hit, RAID for protection of this nature will speed you up.
Async replication has almost NO performance hit on the master.
HA Application Clustering is always sync, though, not async. Application needs to wait for confirmation for its peers before unlocking, or else it is not competing with RAID for data protection.
-
@francesco-provino said in Application clustering VS RAID with modern SSD:
@scottalanmiller said in Application clustering VS RAID with modern SSD:
Reliability:
Application Clustering requires everything be duplicated, even CPUs and RAM, so there are some benefits to reliability improvements from the high cost of redudancy. But typically these are minor, as the extra redundancy is typically around pieces that rarely fail. It's a brute force redundancy, rather than a finesse redundancy.
RAID targets the pieces of the system that are most fragile and critical - the storage. It is the drives failing alone that causes full data loss, and drives represent the majority of hardware failures. So you get 99% of the protection, at a fraction of the price.
Because RAID is so mature and reliable, there is an argument that that combined with its insane speed, cache protection options and such will actually be safer than application layer protections that are comparable.
This is unfair, you are really comparing apple to oranges: in one case you have a completely shared-nothing cluster, in the other you are just protected from storage disk failure. What if the cpu/mobo/controller/psu/etc fail?
It may be apples and oranges, but that's where it starts. It's two very different things and under normal circumstances you never consider application replication unless you have RAID. RAID is cheap and really effective.
Although it might seem like apples and oranges, it's like turbo charging or getting a bigger engine - very different techniques, same goal. Here it is two reliability techniques, one goal. The point is, of the two, 90% of the time RAID is actually more effective. It might SEEM like having all those other parts with extra redundancy would do a lot for you, but in the real world it doesn't do all that much. And the risks you take on by avoiding the RAID will rarely be offset by all that extra redundancy.
Think of it like an airplane... the thing you want redundant is the engine. Sure extra seats, steering wheels, wings, wheels, etc. all sound great, and if they are free then sure, but 95% of the time it is the engine that fails, not any of those things. So you can't get distracted by the "what if X happens", you have to remain focused on the resultant reliability and I think that you will find that RAID is either around a break even or even safer than a non-RAID general redundancy approach at the same redundancy level (single mirror, double mirror, etc.)