StarWind HCA is one of the 10 coolest HCI systems of 2019 (so far)

DustinB3403

@dyasny said in StarWind HCA is one of the 10 coolest HCI systems of 2019 (so far):

mixing everything on a single host is a bad idea.

What do you mean, mixing everything? The magic sauce is what makes tools like Starwinds vSAN an amazing tool. It works with the hypervisor to manage all of your hosts from a single interface. Should any host go down, those resources are offline, but the VM's that may have been on there are moved to the remaining members of the HCI environment (of multiple physical hosts).

scottalanmiller

@dyasny said in StarWind HCA is one of the 10 coolest HCI systems of 2019 (so far):

mixing everything on a single host is a bad idea.

No, it's separating it that is the bad idea. Separate means less performance and more points of failure. It's just like hardware and software RAID... when tech is new you need unique hardware to offload it, over time, that goes away. This has happened, at this point, with the whole stack. And did long ago, there was just so much money is gouging people with SANs that every vendor clung to that as long as they could.

But putting those workloads outside of the server make it slower, costlier, and riskier. There's really no benefits.

scottalanmiller

@dyasny said in StarWind HCA is one of the 10 coolest HCI systems of 2019 (so far):

Also note, I always use the term HCI, not just HC, and I always mean it to be exactly what it is being sold as - a way of building virtualized infrastructure so that the shared storage in use, is provided by the same machines that host the workloads, off of their internal drives.

That's fine, but that's not HC or HCI. That's one vendor's product of it (or several.) HC is not the property of a vendor, it's an architecture, and an old one that has battle tested and logically is the only primary way to build systems.

DustinB3403

The easiest way I can think to explain your rational @dyasny is to pretend I'm building a server, but because I don't trust the RAID controller that I can purchase for my MB, I purchase a bunch of external disks, plug those into another MB and then attach that storage back to my server via iSCSI over the network.

How is this safer, more reliable and cheaper than just adding all of the physical resources into a single server? Then combining 2, 3 or however many of the identical servers together with some magic sauce and managing it from a single interface?

dyasny

@DustinB3403 said in StarWind HCA is one of the 10 coolest HCI systems of 2019 (so far):

HCI isn't just shared storage. It's shared everything.

Great, so we are also running the SDN controllers on all the hosts. Even an OVN controller is a huge resource hog. A Neutron controller in Openstack is even worse. And then the big boys come in, have you tried to build an Arista setup?

I am not talking theory here, I'm talking implementation, as someone who built datacenters and both public and private clouds at scale. Running the entire stack on each host, along with the actual workload is a horrible idea.

What do you mean, mixing everything? The magic sauce is what makes tools like Starwinds vSAN an amazing tool.

Sounds like marketing bs to me, sorry Magic sauce? Really?

It works with the hypervisor to manage all of your hosts from a single interface. Should any host go down, those resources are offline, but the VM's that may have been on there are moved to the remaining members of the HCI environment (of multiple physical hosts).

Sounds like any decently built virtualized DC solution, from proxmox to ovirt to vcenter and xenserver. How is it "magic" exactly?

The easiest way I can think to explain your rational @dyasny is to pretend I'm building a server, but because I don't trust the RAID controller that I can purchase for my MB, I purchase a bunch of external disks, plug those into another MB and then attach that storage back to my server via iSCSI over the network.

This is a ridiculous example. What you describe is instead of having a server with a disk controller, disks , GPU and NICs, I'd install a single card that is a NIC, a GPU and can store data. So that instead of the PCI bus accessing each controller separately with better bandwidth, all the IO and different workloads are driven through a single PCI bus channel. And then use "magic" to install several of those hybrid monster cards in the hopes of making them work better.

How is this safer, more reliable and cheaper than just adding all of the physical resources into a single server? Then combining 2, 3 or however many of the identical servers together with some magic sauce and managing it from a single interface?

There you go with the magic sauce koolaid again.

dyasny

@scottalanmiller said in StarWind HCA is one of the 10 coolest HCI systems of 2019 (so far):

HC was always a thing, though, that's the thing. That it got buzz is different. We've had HC all along, just people didn't call it anything.

OK, just so we're on the same page here, are you saying we should simply install a bunch of localhosts and be done, for all the types of workloads out there?

No, it's separating it that is the bad idea.

No, it's mixing it that is the bad idea. See, I can also do this

Separate means less performance and more points of failure.

It would seem so, but in fact, you already have to run those services (storage, networking, control plane) anyway, and they all consume resources, and a lot of them. And then you dump the actual workload on the same hosts as well, so either you simply have much less to assign to the workload and the services, or they have to compete for those resources. Either is bad, and when one host fails, EVERYTHING on it fails. So you have to not just deal with a storage node outage or a controller outage, or a hypervisor outage, but with all of them at the same time. How exactly is that better for performance and MTBF?

It's just like hardware and software RAID... when tech is new you need unique hardware to offload it, over time, that goes away. This has happened, at this point, with the whole stack. And did long ago, there was just so much money is gouging people with SANs that every vendor clung to that as long as they could.

I'm not saying SANs are the answer to everything, I'm saying loading all the infrastructure services plus the actual workload on a host is insane. If you have a cluster of hosts providing FT SDN, and another cluster providing FT SDS and a cluster of hypervisors using those service to run workloads using the networking and storage provided, I'm all for it. This system can easily deal with an outage of any physical component, without triggering chain reactions across the stack. But this is just software defined infrastructure, not HCI.

But putting those workloads outside of the server make it slower, costlier, and riskier. There's really no benefits.

Again, I don't care much for appliance-like solutions. A SAN or a Ceph cluster, I can use either, hook it up to my hypervisors and use the provided block devices. But if you want me to run the (just for example here) Ceph RBD as well as the VMs and the SDN controller service on the same host - I will not take responsibility for such a setup.

Dashrender

@dyasny said in StarWind HCA is one of the 10 coolest HCI systems of 2019 (so far):

It would seem so, but in fact, you already have to run those services (storage, networking, control plane) anyway, and they all consume resources, and a lot of them. And then you dump the actual workload on the same hosts as well, so either you simply have much less to assign to the workload and the services, or they have to compete for those resources. Either is bad, and when one host fails, EVERYTHING on it fails. So you have to not just deal with a storage node outage or a controller outage, or a hypervisor outage, but with all of them at the same time. How exactly is that better for performance and MTBF?

@scottalanmiller - where is your - hypervisors are not basket and eggs - post?

scottalanmiller

@Dashrender said in StarWind HCA is one of the 10 coolest HCI systems of 2019 (so far):

@scottalanmiller - where is your - hypervisors are not basket and eggs - post?

https://smbitjournal.com/2012/11/virtual-eggs-and-baskets/

dyasny

@Dashrender this is not about the basket/eggs thing, consolidation is well and good, but HCI adds a massive load on each host, and the resources for that load have to come from somewhere. SDS is not easy and it does demand CPU, RAM and network resoruces. SDN is just as bad. Lump it all into the same host, and you've got nowhere to run VMs adequately, that's my point.

There's a very old joke - a man is pulled over by a policeman, as he was driving with one hand and hugging his girlfriend with the other. The policeman says "Sir, you are doing two things and both of them badly". This is exactly why HCI is wrong. Yes, if all you have is a single machine, you'll be lumping all your workloads on it, but if you are building a real datacenter, you better do the networking stack properly, using the right hardware: even if it's going to be some opensource SDN like Calico and not a suitcase of money sent to cisco, you should dedicate the correctly spec'd hardware to that, the same goes for the storage stack - you want to run on commodity hardware using opensource SDS software - be my guest, but dedicate those hosts to SDS and spec them out to fit the task, and the same goes for the workload-bearing machines, whether they will be KVM hypervisors or a docker swarm or an overprices vmware cluster - that's immaterial. If you do the HCI thing, you cannot spec the hardware to the task, you end up running all of those services and workloads on the same set of hosts, and all those tasks will be sharing that hardware, either competing for resources, or cutting available un-utilized resouces away from where they could be needed.

Yes, the nicer HCI systems can try to keep the data they serve balanced so that it is at least partially local to the workload, but in a properly build virtual DC this is not a problem. Infiniband, FC and even FCoE make latency moot, and throughputs can be much higher than over a local SAS or even NVMe channels.

DustinB3403

@dyasny said in StarWind HCA is one of the 10 coolest HCI systems of 2019 (so far):

a man is pulled over by a policeman, as he was driving with one hand and hugging his girlfriend with the other. The policeman says "Sir, you are doing two things and both of them badly"

That's a joke?

scottalanmiller

@DustinB3403 said in StarWind HCA is one of the 10 coolest HCI systems of 2019 (so far):

@dyasny said in StarWind HCA is one of the 10 coolest HCI systems of 2019 (so far):

a man is pulled over by a policeman, as he was driving with one hand and hugging his girlfriend with the other. The policeman says "Sir, you are doing two things and both of them badly"

That's a joke?

LOL, yes.

scottalanmiller

@dyasny said in StarWind HCA is one of the 10 coolest HCI systems of 2019 (so far):

but HCI adds a massive load on each host,

This is they myth. In most HCI it adds no appreciable load. As long as you believe that things like storage and networking are going to create a lot of load, yes, this is going to seem like a point of risk, although even then things like RAID cards fixed that in the era where that was true.

But since it doesn't add load, and actually adds less load than splitting it out, this logic is backwards.

dyasny

@DustinB3403 said in StarWind HCA is one of the 10 coolest HCI systems of 2019 (so far):

That's a joke?

You should have seen my British friend tell it

scottalanmiller

@dyasny said in StarWind HCA is one of the 10 coolest HCI systems of 2019 (so far):

SDS is not easy and it does demand CPU, RAM and network resoruces. SDN is just as bad. Lump it all into the same host, and you've got nowhere to run VMs adequately, that's my point.

SDS isn't part of HC. This might be a root of your confusion. This is why some HC, like the one for whom the thread is about doesn't do this and just does RAID. Overhead is ridiculously low.

dyasny

@scottalanmiller said in StarWind HCA is one of the 10 coolest HCI systems of 2019 (so far):

This is they myth. In most HCI it adds no appreciable load. As long as you believe that things like storage and networking are going to create a lot of load, yes, this is going to seem like a point of risk, although even then things like RAID cards fixed that in the era where that was true.

But since it doesn't add load, and actually adds less load than splitting it out, this logic is backwards.

I already answered that above. Just because you say it doesn't add any load, doesn't mean it doesn't.

SDS isn't part of HC. This might be a root of your confusion. This is why some HC, like the one for whom the thread is about doesn't do this and just does RAID. Overhead is ridiculously low.

How exactly do they deal with the HA side of things? With RAID, and a host going down, all the VMs using that host go down, RAID or not.

DustinB3403

@dyasny said in StarWind HCA is one of the 10 coolest HCI systems of 2019 (so far):

@DustinB3403 said in StarWind HCA is one of the 10 coolest HCI systems of 2019 (so far):

That's a joke?

You should have seen my British friend tell it

I don't really get it, but okay.

scottalanmiller

@dyasny said in StarWind HCA is one of the 10 coolest HCI systems of 2019 (so far):

even if it's going to be some opensource SDN like Calico and not a suitcase of money sent to cisco, you should dedicate the correctly spec'd hardware to that, the same goes for the storage stack - you want to run on commodity hardware using opensource SDS software - be my guest, but dedicate those hosts to SDS and spec them out to fit the task, and the same goes for the workload-bearing machines,

This just doesn't hold up in the real world. Are there cases where you'd want this? Absolutely. But in general? No. Most companies and workloads are not trying to do things where this makes sense at all. Giant storage pools, on host SDS, etc. These are the things that, while they have their place, is almost exclusively just vendors raping customers who don't realize that this stuff isn't helping them.

All this SDS like this is doing is making new SAN pools, right back to the complexity, costs, and risks that we had before.

If your design creates all this overhead, whether you address it in one box or many, chances are that design itself is the flaw. Not always, but generally. But whether you have RAID or RAIN, the overhead to do this stuff just isn't there when implemented well. Now sure, if we only look at totally garbage solutions, we can make any design seem like a problem. but you have to separate good design from good products. Bad products exist even in well designed solutions.

scottalanmiller

@dyasny said in StarWind HCA is one of the 10 coolest HCI systems of 2019 (so far):

I already answered that above. Just because you say it doesn't add any load, doesn't mean it doesn't.

Right, but because we measure it and there isn't means that there isn't. Just because you claim a load that no one has, which is all you are doing, doesn't make it real. You have created problems that no one faces and are acting like we are all impacted by them.

You are literally claiming that contrary to all evidence, common sense, and industry knowledge, that software RAID is not just a huge load, but so large that we now need not only hardware RAID cards to do it, but entire hardware RAID servers!

scottalanmiller

@dyasny said in StarWind HCA is one of the 10 coolest HCI systems of 2019 (so far):

How exactly do they deal with the HA side of things? With RAID, and a host going down, all the VMs using that host go down, RAID or not.

Um, RAID makes a copy. So both machines have identical copies, RAID 1 is just mirroring - they even mirror the RAM cache. There's no magic. I think you are not aware of what HC products are like in the real world and are making loads of assumptions. My guess is you've seen Nutanix, the one god awful total failure of an HC product (that didn't even exist until I was consulting on HC for many years) and assuming that what they do badly or wrong is part of HC when it is just a Nutanix thing.

Many HC products use network RAID, not RAIN, and have none of the issues you are picturing. And many RAIN implementations, like SCRIBE, don't have them either. No one is arguing that Nutanix is bad or has these problems, we are explaining to you that your limited interpretation of HC to mean something different than it means to anyone else, is making it seem like these problems are endemic when, in fact, they aren't even likely. You are basically defining HC to mean Nutanix, when to everyone else, Nutanix isn't even a player, just a complete joke that exists only for marketers to screw the unprepared.

You should really research the field and products before making these wild claims. It's trivial to show that your assumptions can't be true, because we can demonstrate it. No one is denying that bad products are bad, but you are claiming that good products can't be good because you once saw a bad one. That's like thinking all hamburgers are bad because you once ate at a McDonald's, but that's hardly the only hamburger, let alone the reference example.

DustinB3403

@scottalanmiller said in StarWind HCA is one of the 10 coolest HCI systems of 2019 (so far):

but entire hardware RAID servers!

Not only Hardware RAID Servers, but separate dedicated network stacks, and compute pools as well.