Xen Server 6.5 + Xen Orchestra w. HA & SAN

scottalanmiller

For HA, two nodes.

It comes down to understanding the hardware and the MTBF? Understanding the common failures of said server/generation/caveats?

Understanding risk or percentage change as to an actual hardware failure which would result in a node going down, and the need for fail-over scenerio

I see those as logical talking points and reasoning when looking at HA and if its a need or not

Not really. Those are background noise.

All HA discussions, all... no exceptions, come down to two factors and two factors alone...

How much money is lost from different downtime events.
How much money can be spent to mitigate those events to what level.

That's it. That's the whole thing. You are currently skipping the first step entirely. Your CFO, even if it is his first day working in any business, should be all over that. Even a business intern from college should be all over the fact that the "risk" isn't being considered at all. Risk is measured in dollars and no one has thrown out a figure yet. So the very first thing you need to have this discussion... in fact the only thing that should trigger this discussion... is missing.

ntoxicator

@scottalanmiller said:

@ntoxicator said:

@scottalanmiller

CEO/CFO & management, will not purchase new hardware unless they're certain it'll last for 5+ years and handle the load of 500+ employee's by year 2020. All by company projections and their hiring needs/growth rate statistics.

better than to think of ANY financial investment in terms other than "what is best for the business." They are getting emotional and emotional C levels drive companies in the ground.

I think you made a point here and the comment around "What is best for the business". They're trying to also operate on a cash basis... also building out a new 70,000 - 112,000 sq ft. facility to house all new employee's. to be built by end of 2017. I personally doubt their timeframe. I have my doubts

I've tried to explain my case and even a new single server approach (updated hardware & specs) to hold our VM resource needs... always gets brushed off. Happening for past year.

However, have recently asked I find pricing/quotes for equipment and see what other companies of 'our size' are doing and have success with.

scottalanmiller

I'm not saying that HA is wrong for you, I've never said that once. What I've said is that nothing you've said has supported the idea that HA is what you need but a lot of what you have said and more that you are stating suggest that you do not.

If there is a reason for HA, it's not been mentioned yet.

scottalanmiller

@ntoxicator said:

I think you made a point here and the comment around "What is best for the business".

Which would never involve "we have to spend today for five years in the future." It would always be "we need to buy today what is best for the business." Forcing a five year investment in technology is not how you keep "what is best" in mind and goes against basic good practices.

http://www.smbitjournal.com/2012/10/you-arent-gonna-need-it/

ntoxicator

@scottalanmiller said:

I'm not saying that HA is wrong for you, I've never said that once. What I've said is that nothing you've said has supported the idea that HA is what you need but a lot of what you have said and more that you are stating suggest that you do not.

If there is a reason for HA, it's not been mentioned yet.

Understood

I have an email and documentation to CEO/CFO & management team. Asking them what downtime is worth to them, or the cost of downtime? To the cost of having infrastructure to mitigate that. Also asked about their expectations, as well as ROI and TCO

scottalanmiller

@ntoxicator said:

They're trying to also operate on a cash basis... also building out a new 70,000 - 112,000 sq ft. facility to house all new employee's. to be built by end of 2017.

And even while doing so they feel the need to overspend today in the hopes of it being useful tomorrow? If they are doing big growth projects, it seems like being bold on unnecessary spending early would not make sense. Especially as growing later or adding HA later would be cheaper that doing so today. So the investment today is just setting money on fire unless it is needed right away.

scottalanmiller

@ntoxicator said:

I've tried to explain my case and even a new single server approach (updated hardware & specs) to hold our VM resource needs... always gets brushed off. Happening for past year.

Do you feel that that would be the actions of a company that needs HA?

Step back and look at this from the outside it.... without knowing that someone is demanding HA, everything suggests the opposite. Business leaders being callous with spending, treating IT a bit as an afterthought, running for years with overspending and under producing... would that lead to a left field jump to HA? IT might, but not likely.

scottalanmiller

@ntoxicator said:

However, have recently asked I find pricing/quotes for equipment and see what other companies of 'our size' are doing and have success with.

Also bad signs. Not that getting an idea of what companies "your size" are doing is necessarily bad. But two really critical things...

Every company is unique and if they weren't then they have no value.
Most companies run IT horribly and you want to avoid what most do.

If you look at SW, we've been showing "average companies your size" what a bad idea HA is for them for years. I've literally had something like only 1-2% come back with the numbers and turn out that HA actually was for them. 98% are just running their businesses poorly.

ntoxicator

@scottalanmiller

I greatly appreciate your brutal honesty in your replies. I suppose, i took an emotional response as jabs at me. But your outside view plays a key role in your driven responses.

scottalanmiller

@ntoxicator said:

I have an email and documentation to CEO/CFO & management team. Asking them what downtime is worth to them, or the cost of downtime? To the cost of having infrastructure to mitigate that.

Make them put it into calculation numbers. Meaning....

$5/minute or $10,000/hour.

It should be complex in 99% of cases, not just what I said about. It might be like this...

Timeline:

0-10 minutes: $0
10 minutes - 2 hours: $50,000
Then $12,000/hr
Until 48 hours, then $5K/hr
At one week it shoots to $500K and we are likely out of business

scottalanmiller

@ntoxicator said:

@scottalanmiller

I greatly appreciate your brutal honesty in your replies. I suppose, i took an emotional response as jabs at me. But your outside view plays a key role in your driven responses.

And that I do exactly this more than once every day and have for nearly a decade.

ntoxicator

@scottalanmiller said:

@ntoxicator said:

I have an email and documentation to CEO/CFO & management team. Asking them what downtime is worth to them, or the cost of downtime? To the cost of having infrastructure to mitigate that.

Make them put it into calculation numbers. Meaning....

$5/minute or $10,000/hour.

It should be complex in 99% of cases, not just what I said about. It might be like this...

Timeline:

0-10 minutes: $0
10 minutes - 2 hours: $50,000
Then $12,000/hr
Until 48 hours, then $5K/hr
At one week it shoots to $500K and we are likely out of business

Thank you. I've asked for this before. For them to calculate what we make on average a day aggregate with all our clients. Or average cost for what downtime costs. Will propose this again.

scottalanmiller

@ntoxicator tell them that without those numbers you have to assume that the losses would be minimal because of they were significant they would know how important it was for you to have them.

Instil in them that their actions are informing you where their words are not.

ntoxicator

Absolutely... Noted.

Also doesn't help, for 2years I've been begging for an IT budget. They want me being director of IT, but without a budget to work with. Very difficult to make decisions in best interest. I just give 'sugestions'. Then we have a guy internally that will beat up our vendors on pricing, more so on my efforts - or they get the approval. But thats besides the point, and more of internal struggles

scottalanmiller

You might want to bring someone in from operation and talk about mitigation strategies should there be downtime. For example...

How much can you do with the server down? Lots of companies can keep doing something.
If you had extended downtime, could you shift lunch breaks, send people home early, get them back early the next day, do a company picnic, whatever, to offset the downtime?
Could you restore critical workloads to old hardware?
Could you work from a cloud resource?
Could you run off of your backup appliance?

The list goes on and on.

scottalanmiller

@ntoxicator said:

Also doesn't help, for 2years I've been begging for an IT budget. They want me being director of IT, but without a budget to work with.

Honestly, while that sounds bad, it is not. IT should never have a budget, that's a bad thing. Budgets mean that no one understands how it works.

In reality you should buy what is best for the business. That number is always better than the budget number. Generally it is much smaller than what companies will budget - budgets most often cause wild over spending. And sometimes when you need something important, or to invest in the future, a budget kills it and you have to "make due" with something less suited to the needs and financial future of the business.

scottalanmiller

We went through a major "single point of failure" event last year. It was one of those "all hands on deck, massive disasters" that people fear in IT. It was our biggest one in a decade and a half. There was no failover system. It took a monumental effort to get things back online. Everything that could go wrong, did. It was huge, it was painful and it was very, very emotional.

And when it was all said and done and we did the post mortem... as you should do, the final answer was this....

Yes, it was painful and emotional and costly... but not as costly as it would have been to have mitigated the risk. We knew, at the end of the outage, how much money was lost. We also know how much we would have spent to have HA to "maybe" have avoided the outage. Had we paid for the HA and had it worked perfectly.... it would still have been the wrong decision. Even having the incredibly unlikely outage that we had, HA would have been the bigger "outage" or "money loss event."

scottalanmiller

"As you should do", I realized, can be stressed two ways.

As you should do or as you should do.

I meant the latter. Wasn't saying that you should go do one, I meant that after an outage you should run a post mortem.

scottalanmiller

Two important things to think about with HA when running numbers...

HA isn't fool proof. It can fail and sometimes does. Not often, but it can. So it mitigates only "most" scenarios.
HA requires the issues to be IT issues. What if there is a fire or a flood, platform level HA will do nothing.

scottalanmiller

Also... many systems should not use platform HA. Active Directory, for example, you should have HA turned off. You need to quantify which workloads would be on HA and which would not for your calculations.