ESXi cluster, advice needed

scottalanmiller

@rtfm said in ESXi cluster, advice needed:

we have invested time and money on vmware hypervisor and our poor IT would not like to throw this away.

This is a business fallacy called the "sunk cost fallacy." Learning a hypervisor is a trivial matter. And moving to simple solutions is a long term investment. IT shouldn't "want" to do anything, they should simply be focused on what's best for the business. That, alone, defines what IT's job is.

https://en.wikipedia.org/wiki/Sunk_cost

Learning another product is just a few hours for those that don't know virtualization, and often "zero" time because it is so easy.

scottalanmiller

@rtfm said in ESXi cluster, advice needed:

i understand that all above mentioned arguments are usually trivial in competitive environments, but unfortunately this is not our case.

It really is. You can't be stuck in a case where good options aren't available. They might be refused because of politics, but there is a difference between choosing one thing, and being stuck with only one thing as an option.

scottalanmiller

@rtfm said in ESXi cluster, advice needed:

BTW what happens if a single node, or a node with local storage is lost?

This is a bad way to think about risk. You need to look at the whole, not "what if" scenarios to understand risk. Looking at a "what if" makes you do really bad things. "What if a meteor hits?" would make you put backups on Mars, for example. That's not realistic.

This is what backups are for. Having a standalone node doesn't mean you have no backups nor that you don't have something to restore to. So what happens if a single node is lost? You keep running on another node, restored from backup. This is how most companies in the world handle it, and they do it because it's an extremely cost effective, and safe pattern. It requires the least investment, and the least IT knowledge, and has the least chance of failing due to complexity.

scottalanmiller

@rtfm said in ESXi cluster, advice needed:

isn't that a potential cause for filesystem corruption?

Any filesystem can get corruption. But these days, that's rare. That's mostly a 1990s and 2000s problem. By 2005, production filesystems, even on Windows, are so stable that we don't really see this. Not that it can't happen. But it used to be common, now it's something most pros won't see in a lifetime.

That said, having combined storage like VSAN, CEPH, etc. make this far more likely because there is so much more complexity in the storage layer. Standalone, again, protects you (just a tiny bit) here by lowering the complexity and making the basics more reliable.

Remember, a brick is simple and almost never fails. It's hard to engineer any structure that, through redundancy or complexity, is more reliable than a brick because even though a brick is simple and singular, it's just insanely reliable. Standalone systems are more like a brick than anything. Also, like bricks, stand alone approaches are cheap.

Bricks with backups are hard to beat.

scottalanmiller

@rtfm said in ESXi cluster, advice needed:

Moreover, how do i put the host in maintenance mode (Hmmm, and why should i do that if i only have one host, especially with let's say free esxi?)?

Simply put, you don't. Let's use KVM as an example. And a stand alone setup, as well. Very few business, essentially none, have zero potential downtime options. Downtime is actually very cheap and an important part of protecting against the real problem - unplanned downtime. Regular, small planned downtime is generally free or essentially free, heck even Wall St. does it this way to save money, and you don't need maintenance modes. You just reboot (or whatever.)

If you truly need zero downtime, then that cannot be addressed by the virtualization layer and this conversation is moot because that means you need full redundancy at the application layer (otherwise you can't patch you operating systems, databases, applications, etc.) at which point you can do maintenance modes the only viable way - at the application layer anyway and stand alone hypervisors make no difference.

So if you need non-stop 24x7 operations, or you don't, in both cases you can do it effectively with stand alone nodes.

scottalanmiller

@rtfm said in ESXi cluster, advice needed:

we already have these VMs hosted in a 4-node flexpod environment (if vmware enterprise plus is an overkill for us, then how would you judge flexpod???).

This is unfortunate. Kind of the "worst of the worst". There is no way to really sugar coat this. It's the worst hardware on the market (Cisco), with a rather poor storage layer (NetApp), with an unnecessarily expensive hypervisor (VMware) where you end up with really, really high cost and hardware/setup that many of us would want to just throw in the trash.

Does it work? Well, kinda. Chances are the cost of this one purchase alone would have paid to hire an IT department to solve the bigger problems and implement a simpler, more efficient, more reliable solution that addresses your needs, rather than just empties your coffers.

This is, unfortunately, a setup designed specifically to prey on companies that think that they have to buy "products" instead of expertise and that they can skip IT. But it doesn't work that way. To quote VMware themselves "high availability is something you do, not something that you buy." Even the companies that sell this product don't believe that this is in any way a substitute for getting access to IT resources that are going to look at your needs and engineer a solution based around them.

We all understand that this means that you have now already purchased all of this and that there is no way to fix that. The only thing you can do now is look at the scale of this mistake and use it as a learning exercise to go back to your company and try to address the broken thought processes that brought them to what should have been an obvious "never do this" scenario. This suggests that they likely do the text book "never do this in business" things of engaging sales people, resellers and the vendors asking how they can spend money rather than getting business experts to actually figure out what the needs are, and what would address them. The goal had to be "how do we spend money", not "how do we solve a business need." There is a massive opportunity for improvement here, but it won't help until "next time." but there will be a next time, so this lesson is insanely important to learn.

That said, though, you are asking how to fix this. You've figured out that this setup isn't good. That's the first step. You know that you need reliable storage instead of the RAID 4 NAS device single point of failure, that's good. I would step all the way back and consider all of it a waste and look at "how best to move forward" based on what you own, and try to remove IT's emotions from it because it is what it is, and those emotions will only hurt the company (and IT itself) long term.

scottalanmiller

@IRJ said in ESXi cluster, advice needed:

First of all, you need move every service you can to SaaS and Pass solution. Get rid of your database servers on prem and put them on PaaS solution.

These are good things to consider. But we have no way to know if they have any option to do these things. There is never a possibility of "never have on prem databases." Prem vs. hosted is always based on business needs and neither "always on prem" or "always hosted" will ever be as simple as one way being the right way. This is always, no exceptions, something that has to be analyzed and determined.

Sure, there is excellent potential for hosted to be a good option, but there is every possibility that it's not even viable. As an MSP with hundreds of customers, the average cannot use a hosted database or any hosted data store, it's not even a possible option, let alone a viable one.

scottalanmiller

@IRJ said in ESXi cluster, advice needed:

Why do you want to manage infrastructure especially if you are short staffed in IT?

The real question is, why is IT short staffed? It's impossible (literally, impossible) that staffing IT can't be done since we know that they have money. A lot of money. There is nothing safer or more cost effective than properly staffed IT. In fact, IT's proper staffing level is actually defined by the level that makes the most money.

So the fundamental question is... why is the business being put at risk and money being thrown away while avoiding staffing IT?

(Staffing IT can mean hiring internally or outsourcing, however it is handled.)

scottalanmiller

@stacksofplates said in ESXi cluster, advice needed:

My point is, you can't just say "this is available through open source tools" and expect people to be able to do a setup like that from scratch with no experience.

This point is correct. What we know is that there are tools (open, free, low cost, closed, high cost, etc.) that can do this and plenty of money to hire the IT expertise to do it. There are alternate approaches that hands making everything easier (going with Scale, for example) and ones that are cheaper (hiring IT experts to dig into needs and do what needs to be done), etc.

The real underlying point that I think is important is that getting good IT is cheap. Nothing is cheaper. Anything that is done that deviates from having the right IT resources is going to be more costly in the long run. Maybe more costly because things take longer to get done, maybe because money is spent where it doesn't need to be, maybe by there being risks that didn't need to be there. IT's one purpose is to make the company money. Intentionally "not making money" is a crazy business approach

rtfm

@IRJ said in ESXi cluster, advice needed:

First of all, you need move every service you can to SaaS and Pass solution. Get rid of your database servers on prem and put them on PaaS solution. Why do you want to manage infrastructure especially if you are short staffed in IT?

this is public sector and it is not allowed yet.

rtfm

@travisdh1 said in ESXi cluster, advice needed:

@rtfm said in ESXi cluster, advice needed:

hi everybody!
first of all thank you for your contribution. Keep it simple...
However, i did not mention (intentionally, no ofence i will explain myself) the following facts:

we already have these VMs hosted in a 4-node flexpod environment (if vmware enterprise plus is an overkill for us, then how would you judge flexpod???).

our organization is rich in terms of money but poor in terms of IT intellectual capital. Therefore we need outsourced support. in our place it is hard to find that, so we usually address ourselves to certified solutions.

we have invested time and money on vmware hypervisor and our poor IT would not like to throw this away.

the initial question was supposed to refer to a DRS solution based on vmware SRM (VM based replication, not array based), however years after studying your recommendations i would like to try something more simple.

sorry for wasting your time. i am thankful to you for your recommendations. BTW what happens if a single node, or a node with local storage is lost? isn't that a potential cause for filesystem corruption?

Just a waste of money

What region of the world are you located in?

This is just plain bad thinking. IT by it's nature is always changing. Learning something different should be very quick, resisting change just because you already know something is just the opposite of what IT should be doing.

Also just a waste of money

Your statement about competitive environments doesn't make any sense. Many of the solutions mentioned are open source and available to anyone with an internet connection.

Assuming you are running with at least the 3-node minimum and a single node is lost, nothing happens. Once the node is put back online, everything is automatically handled in the background for you (Starwind, Gluster, Ceph, Scale).

No need for a "maintenance mode". Updates are handled without the need of a reboot, but we still recommend power cycling everything on a regular basis.

hello,

agree. but have you ever thought that money might not be a problem at all in some cases?
balkan peninsula...europe.
what if you are responsible for the services and the infrastructure and you don't have other IT staff and you are not allowed to hire?
see "1"

As far as "competitive" is concerned, let's just say that IT integration without vendor support in greece is not much different than gambling in terms of reliability.
About the technical part solely: i thought you had proposed one server, now they have become three?

rtfm

@Dashrender said in ESXi cluster, advice needed:

@rtfm said in ESXi cluster, advice needed:

hi everybody!
first of all thank you for your contribution. Keep it simple...
However, i did not mention (intentionally, no ofence i will explain myself) the following facts:

we already have these VMs hosted in a 4-node flexpod environment (if vmware enterprise plus is an overkill for us, then how would you judge flexpod???).

our organization is rich in terms of money but poor in terms of IT intellectual capital. Therefore we need outsourced support. in our place it is hard to find that, so we usually address ourselves to certified solutions.

we have invested time and money on vmware hypervisor and our poor IT would not like to throw this away.

the initial question was supposed to refer to a DRS solution based on vmware SRM (VM based replication, not array based), however years after studying your recommendations i would like to try something more simple.

i understand that all above mentioned arguments are usually trivial in competitive environments, but unfortunately this is not our case.

sorry for wasting your time. i am thankful to you for your recommendations.
BTW what happens if a single node, or a node with local storage is lost? isn't that a potential cause for filesystem corruption?
Moreover, how do i put the host in maintenance mode (Hmmm, and why should i do that if i only have one host, especially with let's say free esxi?)?

?

Why is your organization poor on IT capital? Why not hire consultants to do it? No reason to have them be on staff, is there?

This is the sunk cost fallacy. That money is already spent, consider it gone and move forward with a most cost effective solution - that said, sometimes where you already are is the most cost effective when all aspects are considered - at least until a full overhaul is required.

?

it's plain old fashioned public sector, the last IT hired was 15 years ago...

rtfm

@scottalanmiller said in ESXi cluster, advice needed:

@rtfm said in ESXi cluster, advice needed:

our organization is rich in terms of money but poor in terms of IT intellectual capital. Therefore we need outsourced support. in our place it is hard to find that, so we usually address ourselves to certified solutions.

This is a misunderstanding of markets. There is no such thing as a place with hard to get IT. IT has no location and there are essentially unlimited numbers of available excellent resources ready to assist any business. Businesses simply choose not to look for or hire them and instead hire sales people who screw them and hide their costs in "products" rather than honest or qualified advice. Every business should have outsources support, almost no company is big enough to have all the right people internally. But no business is in a location or situation that it can't get good people.

Certified solutions is really just a way to say "expensive products that are focused on resellers" or, another way, bad solutions that cost you far more to operate. They are channel products designed beginning to end to take advantage of this mindset and to get as much money out of companies that believe this as possible. It's an extremely common and effective game that they play.

If your company doesn't know how to achieve this, then the first thing that they need is a real outsourced CIO. A good CIO will save you a fortune in hours. Running without one is financially reckless.

i agree 100% and i am looking forward to convert your theory into practice.

rtfm

@scottalanmiller said in ESXi cluster, advice needed:

@rtfm said in ESXi cluster, advice needed:

we already have these VMs hosted in a 4-node flexpod environment (if vmware enterprise plus is an overkill for us, then how would you judge flexpod???).

This is unfortunate. Kind of the "worst of the worst". There is no way to really sugar coat this. It's the worst hardware on the market (Cisco), with a rather poor storage layer (NetApp), with an unnecessarily expensive hypervisor (VMware) where you end up with really, really high cost and hardware/setup that many of us would want to just throw in the trash.

Does it work? Well, kinda. Chances are the cost of this one purchase alone would have paid to hire an IT department to solve the bigger problems and implement a simpler, more efficient, more reliable solution that addresses your needs, rather than just empties your coffers.

This is, unfortunately, a setup designed specifically to prey on companies that think that they have to buy "products" instead of expertise and that they can skip IT. But it doesn't work that way. To quote VMware themselves "high availability is something you do, not something that you buy." Even the companies that sell this product don't believe that this is in any way a substitute for getting access to IT resources that are going to look at your needs and engineer a solution based around them.

We all understand that this means that you have now already purchased all of this and that there is no way to fix that. The only thing you can do now is look at the scale of this mistake and use it as a learning exercise to go back to your company and try to address the broken thought processes that brought them to what should have been an obvious "never do this" scenario. This suggests that they likely do the text book "never do this in business" things of engaging sales people, resellers and the vendors asking how they can spend money rather than getting business experts to actually figure out what the needs are, and what would address them. The goal had to be "how do we spend money", not "how do we solve a business need." There is a massive opportunity for improvement here, but it won't help until "next time." but there will be a next time, so this lesson is insanely important to learn.

That said, though, you are asking how to fix this. You've figured out that this setup isn't good. That's the first step. You know that you need reliable storage instead of the RAID 4 NAS device single point of failure, that's good. I would step all the way back and consider all of it a waste and look at "how best to move forward" based on what you own, and try to remove IT's emotions from it because it is what it is, and those emotions will only hurt the company (and IT itself) long term.

hi,
apart from the tremendous contribution from a generic point of view, would you suggest a 2-node setup with local datastores and a fast network and that's it, given that we take as proper backups as possible
? BTW we also have veeam B & R standard edition.

what about the vmware SRM? what do you think of it?

scottalanmiller

@rtfm said in ESXi cluster, advice needed:

@Dashrender said in ESXi cluster, advice needed:

@rtfm said in ESXi cluster, advice needed:

hi everybody!
first of all thank you for your contribution. Keep it simple...
However, i did not mention (intentionally, no ofence i will explain myself) the following facts:

we already have these VMs hosted in a 4-node flexpod environment (if vmware enterprise plus is an overkill for us, then how would you judge flexpod???).

our organization is rich in terms of money but poor in terms of IT intellectual capital. Therefore we need outsourced support. in our place it is hard to find that, so we usually address ourselves to certified solutions.

we have invested time and money on vmware hypervisor and our poor IT would not like to throw this away.

the initial question was supposed to refer to a DRS solution based on vmware SRM (VM based replication, not array based), however years after studying your recommendations i would like to try something more simple.

i understand that all above mentioned arguments are usually trivial in competitive environments, but unfortunately this is not our case.

sorry for wasting your time. i am thankful to you for your recommendations.
BTW what happens if a single node, or a node with local storage is lost? isn't that a potential cause for filesystem corruption?
Moreover, how do i put the host in maintenance mode (Hmmm, and why should i do that if i only have one host, especially with let's say free esxi?)?

?

Why is your organization poor on IT capital? Why not hire consultants to do it? No reason to have them be on staff, is there?

This is the sunk cost fallacy. That money is already spent, consider it gone and move forward with a most cost effective solution - that said, sometimes where you already are is the most cost effective when all aspects are considered - at least until a full overhaul is required.

?

it's plain old fashioned public sector, the last IT hired was 15 years ago...

Ah, that puts it in perspective.

scottalanmiller

@rtfm said in ESXi cluster, advice needed:

would you suggest a 2-node setup with local datastores and a fast network and that's it, given that we take as proper backups as possible

Under most conditions, yes. Fast, easy, effective. If you need absolute uptime, do so higher in the stack at the application level. If you just need really good uptime, take excellent backups, be able to restore really quickly, and probably keep a recent (like hours old) "copy" on the second host so that you can spin up in minutes.

Two stand alone hosts with good backups is often a "couple minutes" of downtime during a crisis solution, instead of a few milliseconds. Sure, a few minutes is relatively long compared to a few milliseconds, but to most businesses (and certainly most governments) a few minutes of downtime every few years doesn't matter at all.

rtfm

@scottalanmiller said in ESXi cluster, advice needed:

@rtfm said in ESXi cluster, advice needed:

would you suggest a 2-node setup with local datastores and a fast network and that's it, given that we take as proper backups as possible

Under most conditions, yes. Fast, easy, effective. If you need absolute uptime, do so higher in the stack at the application level. If you just need really good uptime, take excellent backups, be able to restore really quickly, and probably keep a recent (like hours old) "copy" on the second host so that you can spin up in minutes.

Two stand alone hosts with good backups is often a "couple minutes" of downtime during a crisis solution, instead of a few milliseconds. Sure, a few minutes is relatively long compared to a few milliseconds, but to most businesses (and certainly most governments) a few minutes of downtime every few years doesn't matter at all.

From the requirements perspective i totally agree to your words! Considering our poor in-house human resources an RPO of even one day is totally satisfactory and acceptable! In that case i understand that there is no actual need to mess with HA, FA, SRM etc...

BTW your comment about how "system integrators" try to sell bare metal and certifications instead of brains was more than appropriate!

scottalanmiller

@rtfm said in ESXi cluster, advice needed:

BTW your comment about how "system integrators" try to sell bare metal and certifications instead of brains was more than appropriate!

It's where all the money is. As a consultancy... I can sell you a full IT department for a year. But I have to pay those people and my profits are small, but they can provide you with a wealth of work, advice, etc. building, maintaining, and supporting whatever you need.

But for the same money, I could just sell you a product that you don't need, that sounds good but doesn't do a good job for you, and earn easily triple the profits because I don't need to pay staff.

So the challenge is, if I resell those products, how do I make myself "do the right thing" when the customer is literally paying me only if I screw them over? The customer effectively demands, through how they pay and choose solutions, to only get the bad solutions. As a company, it's all but impossible to resist selling the product because the customer never knows the difference, and you earn so much more money as a salesperson than in provided sound IT. That's why companies like NTG and Bundy Associates simply don't sell any product at all, so that that incentive to do so isn't there at all. Because if it was, it's all but irresistable.