Windows Failover Clustering... what are your views and why?

scottalanmiller

@Obsolesce said in Windows Failover Clustering... what are your views and why?:

@Jimmy9008 said in Windows Failover Clustering... what are your views and why?:

@Obsolesce said in Windows Failover Clustering... what are your views and why?:

@Jimmy9008 said in Windows Failover Clustering... what are your views and why?:

@Obsolesce said in Windows Failover Clustering... what are your views and why?:

How much data changes every day? Do you have 100gb of changes per day? 1Tb?

Last time I ran live optics (a week or two ago), we were at around 6 TB of changes per day

What are your drives warrantied at? What's the dwpd or whatever? The idea is they only need to last 5 years / X dwpd or whatever the period is they are rated for anyways.

1 DWPD, im not so worried about the writes. I just would like to avoid additional writes where not really needed.

WHat drive model are they? How many drives per server? What RAID level is being used? Is it hw/sw raid? If hw, which card?

Starwind is software network RAID.

scottalanmiller

@Jimmy9008 said in Windows Failover Clustering... what are your views and why?:

@Obsolesce said in Windows Failover Clustering... what are your views and why?:

@Jimmy9008 said in Windows Failover Clustering... what are your views and why?:

So, 100 GB is actually 300 GB.

Not for the drives themselves. I'm assuming some kind of hardware/software raid, so that 100GB gets split among the drives the data goes to according to RAID level.

Blocks that are accessed more frequently (read data) don't really count as much, as I'm sure there is caching in multiple places.

If I have a VM on the CSV using 100 GB, the whole point of having the vSAN is that every byte exists on the vSAN partners to avoid any downtime of failure. So, it really is copied entirely three times.

Damn, okay, so definitely RAID 1 Triple Mirror.

scottalanmiller

@Jimmy9008 said in Windows Failover Clustering... what are your views and why?:

@Obsolesce said in Windows Failover Clustering... what are your views and why?:

@Jimmy9008 said in Windows Failover Clustering... what are your views and why?:

@Obsolesce said in Windows Failover Clustering... what are your views and why?:

@Jimmy9008 said in Windows Failover Clustering... what are your views and why?:

@Obsolesce said in Windows Failover Clustering... what are your views and why?:

How much data changes every day? Do you have 100gb of changes per day? 1Tb?

Last time I ran live optics (a week or two ago), we were at around 6 TB of changes per day

What are your drives warrantied at? What's the dwpd or whatever? The idea is they only need to last 5 years / X dwpd or whatever the period is they are rated for anyways.

1 DWPD, im not so worried about the writes. I just would like to avoid additional writes where not really needed.

WHat drive model are they? How many drives per server? What RAID level is being used? Is it hw/sw raid? If hw, which card?

PERC H740P 8 GB Cache. Drive: MTFDDAK1T9TDN
14 per server Raid6

So the result is RAID 16. So three copies of the data, each copy on RAID 6 SSD. That's a LOT of overhead. It's a bit more than the 3x you were stating. You get the capacity of 12 drives out of a pool of 42. 71.5% overhead.

Jimmy9008

@scottalanmiller said in Windows Failover Clustering... what are your views and why?:

@Jimmy9008 said in Windows Failover Clustering... what are your views and why?:

Another thought. CSV data is replicated to all three hosts. So, 100 GB is actually 300 GB. 1 TB is actually 3 TB. Why would it make sense to add VMs (applications) the company can sustain long downtime with to an area where it takes up 3 x the space, on expensive SSDs. Why not put that application you dont care about, on one host, where it takes up one lot of space, leaving the other space for things the company does care about...

Well, the logic would be... if the space is available, why waste it. If the space isn't available, seems like your choice is made for you by design.

The available space has been made for growth expectation over the next few years. If I make all of the local storage CSV and replicated, and copy all VM (including the ones that aren't important) for three times, we won't have that room to grow in to... As it's used for storing the mirror of VMs that don't need HA. Just seems like a waste.

scottalanmiller

@Jimmy9008 said in Windows Failover Clustering... what are your views and why?:

@scottalanmiller said in Windows Failover Clustering... what are your views and why?:

@Jimmy9008 said in Windows Failover Clustering... what are your views and why?:

Data in the CSV will replicate over all three nodes because of Starwind. So, each write to a CSV is actually three fold. If all VMs are in CSV storage and writing to all three hosts, we could considerably lower the life of our SSDs.

Are you sure that this is the setup? If so, this is non-standard for Starwind (or anyone.) That's a triple mirror RAID 1 design. Really fast and really safe, but generally seen as overkill. Standard on Starwind would be three standard double mirror RAID 1 arrays, one between Node 1 and Node 2, one between Node 2 and Node 3, and one between Node 3 and Node 1. So any data would be replicated, but only once, not twice.

Yeah, we had that originally but moved to three node replication. Some local storage for vsan, rest for non vsan. The original plan (still is other than this internal argument) was to use the CSV for machines that need HA, and the rest of the local for machines that don't. Just some folk are pushing for making the whole lot of the local CSV storage and making all VM HA in the cluster...

Just trying to see if any reason why that's a good idea. Seems to be bad from all thought processes I have.

So really how I see it is...

The benefit is a simple, uniform process that wastes gobs and gobs of resources. But wasting resources is kind of how this is designed. This is way beyond nuclear device level protection. Triple Mirror RAID 16 is higher protection than anything I've ever seen, literally ever, including spinners. With enterprise SSD, this is through the roof. You should see data loss from RAID failure, and I'm just guessing, with millions of years between data loss events. Easily tens of millions of years. RAID 1 with spinners is immeasurably high, over 80,000 years. So you can imagine the "adding zeros" math.
This saves on resources and might improve performance. But that doesn't seem like a goal too much. Not that good business isn't always a goal, but it seems like someone has a bee in their bonnet about somethings that it might not be worth rocking the boat on. Keeping everything on the CSV with loads of overhead gives you the least to come back and bite you while all of the money lost from doing this is someone else's problem already.

Personally, if it was my company, I'd do #1. If I were an employee of a company doing this, I'd do #2.

scottalanmiller

@Jimmy9008 said in Windows Failover Clustering... what are your views and why?:

@scottalanmiller said in Windows Failover Clustering... what are your views and why?:

@Jimmy9008 said in Windows Failover Clustering... what are your views and why?:

Another thought. CSV data is replicated to all three hosts. So, 100 GB is actually 300 GB. 1 TB is actually 3 TB. Why would it make sense to add VMs (applications) the company can sustain long downtime with to an area where it takes up 3 x the space, on expensive SSDs. Why not put that application you dont care about, on one host, where it takes up one lot of space, leaving the other space for things the company does care about...

Well, the logic would be... if the space is available, why waste it. If the space isn't available, seems like your choice is made for you by design.

The available space has been made for growth expectation over the next few years. If I make all of the local storage CSV and replicated, and copy all VM (including the ones that aren't important) for three times, we won't have that room to grow in to... As it's used for storing the mirror of VMs that don't need HA. Just seems like a waste.

Feels like a waste, but easily is not. There has been waste, that is certain. But we don't know if using the already wasted space is a problem or not.

If you feel that you can bypass the CSV for this, it seems like you could move to double mirroring and RAID 0, too. If you don't feel that you can change the RAID setup, it seems unlikely that you should opt out of it entirely.

Dashrender

@scottalanmiller said in Windows Failover Clustering... what are your views and why?:

d business isn't always a goal, but it seems like someone has a bee in their bonnet about somethings that it might no

Would you? or would you move it down to a 2 node setup (or the 3 node setup you mentioned above) and use the extra host for something else?

Jimmy9008

@Dashrender said in Windows Failover Clustering... what are your views and why?:

@scottalanmiller said in Windows Failover Clustering... what are your views and why?:

d business isn't always a goal, but it seems like someone has a bee in their bonnet about somethings that it might no

Would you? or would you move it down to a 2 node setup (or the 3 node setup you mentioned above) and use the extra host for something else?

We do have some VMs that do require HA. That's why we built this system. The cost is totally within the budget and the system meets the needs.

The problem I have is that some people are pushing to just put all VM on the clustered storage. Which is against the design. It wasnt designed to make everything HA, as everything doesn't need HA.

It was designed to make everything that needs HA get HA, with room for ability to run VMs, lots of them, that need no HA at all.

scottalanmiller

@Dashrender said in Windows Failover Clustering... what are your views and why?:

@scottalanmiller said in Windows Failover Clustering... what are your views and why?:

d business isn't always a goal, but it seems like someone has a bee in their bonnet about somethings that it might no

Would you? or would you move it down to a 2 node setup (or the 3 node setup you mentioned above) and use the extra host for something else?

I don't know abotu the compute needs. I'm given the benefit of the double that the CPU and RAM were properly sized.

scottalanmiller

@Jimmy9008 said in Windows Failover Clustering... what are your views and why?:

We do have some VMs that do require HA. That's why we built this system. The cost is totally within the budget and the system meets the needs.

Two node is HA. Three node is the overkill

scottalanmiller

@Jimmy9008 said in Windows Failover Clustering... what are your views and why?:

The problem I have is that some people are pushing to just put all VM on the clustered storage. Which is against the design. It wasnt designed to make everything HA, as everything doesn't need HA.
It was designed to make everything that needs HA get HA, with room for ability to run VMs, lots of them, that need no HA at all.

The problem there, is that one group designed it for one purpose. Now the group deploying it is requesting that the design be used for another purpose.

Jimmy9008

@scottalanmiller said in Windows Failover Clustering... what are your views and why?:

@Dashrender said in Windows Failover Clustering... what are your views and why?:

@scottalanmiller said in Windows Failover Clustering... what are your views and why?:

d business isn't always a goal, but it seems like someone has a bee in their bonnet about somethings that it might no

Would you? or would you move it down to a 2 node setup (or the 3 node setup you mentioned above) and use the extra host for something else?

I don't know abotu the compute needs. I'm given the benefit of the double that the CPU and RAM were properly sized.

We can lose 1/3 of the hosts and the remaining VMs have plenty of room to run on the other 2/3 hosts. Tested.

Jimmy9008

@scottalanmiller said in Windows Failover Clustering... what are your views and why?:

@Jimmy9008 said in Windows Failover Clustering... what are your views and why?:

The problem I have is that some people are pushing to just put all VM on the clustered storage. Which is against the design. It wasnt designed to make everything HA, as everything doesn't need HA.
It was designed to make everything that needs HA get HA, with room for ability to run VMs, lots of them, that need no HA at all.

The problem there, is that one group designed it for one purpose. Now the group deploying it is requesting that the design be used for another purpose.

Indeed. It's deployed for and been in use for months now. Just folk don't want to create VMs outside of the CSVs. Where, I'm trying to cover that when deploying a VM, you need to do an analysis bro decide if it needs HA. And if, only if it does, then put it on the CSV. Otherwise, pick one of the three and create the VM on local only. Hell, add the VM to the cluster for management, but keep it local. Sure, it won't failover, but we don't care if it can go down.

scottalanmiller

@Jimmy9008 said in Windows Failover Clustering... what are your views and why?:

@scottalanmiller said in Windows Failover Clustering... what are your views and why?:

@Dashrender said in Windows Failover Clustering... what are your views and why?:

@scottalanmiller said in Windows Failover Clustering... what are your views and why?:

d business isn't always a goal, but it seems like someone has a bee in their bonnet about somethings that it might no

Would you? or would you move it down to a 2 node setup (or the 3 node setup you mentioned above) and use the extra host for something else?

I don't know abotu the compute needs. I'm given the benefit of the double that the CPU and RAM were properly sized.

We can lose 1/3 of the hosts and the remaining VMs have plenty of room to run on the other 2/3 hosts. Tested.

But could you lose 1/2 and keep running, that's the question.

Jimmy9008

@scottalanmiller said in Windows Failover Clustering... what are your views and why?:

@Jimmy9008 said in Windows Failover Clustering... what are your views and why?:

@scottalanmiller said in Windows Failover Clustering... what are your views and why?:

@Dashrender said in Windows Failover Clustering... what are your views and why?:

@scottalanmiller said in Windows Failover Clustering... what are your views and why?:

d business isn't always a goal, but it seems like someone has a bee in their bonnet about somethings that it might no

Would you? or would you move it down to a 2 node setup (or the 3 node setup you mentioned above) and use the extra host for something else?

I don't know abotu the compute needs. I'm given the benefit of the double that the CPU and RAM were properly sized.

We can lose 1/3 of the hosts and the remaining VMs have plenty of room to run on the other 2/3 hosts. Tested.

But could you lose 1/2 and keep running, that's the question.

Probably, actually. The HA VM total around 600 GB RAM used. Each server has 768 GB RAM. So, in theory, yes... We could go down to 1/2 hosts and be up. Not sure how the CPU would cope though, and notich to for growth of the HA VM over the next 3 years or so.

Those VM spread over 2/3 have plenty of RAM available, and CPU, plenty of room for growth which is forecast.

If we lost those non important VM on the 3rd host, and dell couldn't get the part to fix the 3rd for say a week or two, we would even have room on the 2/3 that are up to restore the non critical VM from Veeam.

We do back them all up. It's just my position is we lose a lot adding them to the cluster. Technically, we could start the failed VM from Veeam directly using that instant recovery feature.

It's just hard to justify not using all the CSV space 'as it's there'.... When the reason the space is there... Is for expected growth. If we use for these VM we don't really care about... We lose the ability to grow that actually needs HA.

I think it'll be fine. Just trying to get more logical reasoning why adding to the CSV is silly where HA is not a requirement.

scottalanmiller

@Jimmy9008 said in Windows Failover Clustering... what are your views and why?:

It's just hard to justify not using all the CSV space 'as it's there'.... When the reason the space is there... Is for expected growth.

Just ask that people sign off on removing the investment in expected growth. Clearly someone approved a budget for that. Now something else is deemed more important. So get whoever signed off on investing in growth to approve their budget being taken by someone else.

Obsolesce

The thing is, they are VMs, you can move shit around whenever, to wherever, should the need arise.... and without any downtime if needed. I think each server/service/system should have it's own "SLA" (it's almost midnight can't think of the word now) and should be placed appropriately. Only you can answer whether or not it needs HA. You can do the math to figure out exactly what the cost of each GB of SSD capacity, vCPU, Memory, etc. costs for HA placement versus non HA and decide appropriately where to put the VM. I really don't think Hard Drive life is a concern here, you'll pull through the lifespan of the drive easily, or you won't because it's defective which in that case doesn't matter anyways on your decision. So I don't think that's a factor here. It all comes down to math regarding costs vs what is being considered for placement.

Obsolesce

@Obsolesce said in Windows Failover Clustering... what are your views and why?:

You can do the math to figure out exactly what the cost of each GB of SSD capacity, vCPU, Memory, etc. costs for HA placement versus non HA and decide appropriately where to put the VM.

I did this at my last employer and you'd be surprised what "the business" deems worthy to put on an expensive setup and what they don't when you give them the numbers. My guess is that a lot of things won't end up having HA.

dbeato

The way we have used StarWind VSAN is that we set it up on an Windows Failover Cluster with the VSAN Storage as a CSV and it synced with the nodes minimum of two. Then all the VMs are placed on the CSV unless you want a VM outside that CSV that is not important at all. With the VMS on the CSV you can have one or more servers down and your enviroment will continue to work without interrupting the VMs as long as you have enough Memory and CPU power to have all the VMs on one host or more.

Jimmy9008

@dbeato said in Windows Failover Clustering... what are your views and why?:

The way we have used StarWind VSAN is that we set it up on an Windows Failover Cluster with the VSAN Storage as a CSV and it synced with the nodes minimum of two. Then all the VMs are placed on the CSV unless you want a VM outside that CSV that is not important at all. With the VMS on the CSV you can have one or more servers down and your enviroment will continue to work without interrupting the VMs as long as you have enough Memory and CPU power to have all the VMs on one host or more.

That's what we pretty much have here. What's happening though, is some want to add ALL virtual machines to the CSV, even when they don't need HA. Like you, I want those to not be on CSV.