Burned by Eschewing Best Practices

Dashrender

@Dashrender said:

Hell just look at the other recent posts around here where the guy has a two node HA setup with a NAS. His (and countless others) experience shows that his solution worked, was viable.

This is where we differ and I think that the terminology is incredibly important. You state it that his non-HA setup was viable and experience showed that it worked.

But looking at the same setup I say that it didn't work, it had the company at risk and it wasted money for no reason - all things that I would define as experience with it having failed. As a CIO, I would look at the same setup and see a failure. It did not meet IT goals of protecting the business technically nor financially. Sure, they got lucky and it didn't impact them, but that doesn't suggest that IT, whose job is to provide the right solutions, was successful.

I think that properly defining success is super critical here. And it isn't purely about risk, that's one aspect that cannot be ignored. But the solution actively lost money. Losing money is rarely a signal for success.

How did it loose money? I'm thinking that it didn't because to do a real HA would have required larger servers (able to hold more disks) with a ton more disk for the required storage.

I'm thinking that he actually saved more than a small penny by using fewer disks and probably cheaper servers. Also, since he didn't have to worry about replication, he didn't need 4+ 1 GB Teaming or 10 GB NICs, etc.

scottalanmiller

@dafyre said:

The short, version: It doesn't matter where it lives. If your users (onsite and off) can't get to it... then isn't it an outage?

Is it? So your user leaves the house and their laptop batter dies, is your office having an outage?

Your home user's ISP goes down or they lose power and they can't reach email, is it an outage?

I don't think that it's as simple as "can't reach it."

scottalanmiller

@Dashrender said:

How did it loose money?

Because money was spent without benefit. Physical money that could have been saved, was spent. That's lost money.

dafyre

@scottalanmiller said:

@dafyre said:

The short, version: It doesn't matter where it lives. If your users (onsite and off) can't get to it... then isn't it an outage?

Is it? So your user leaves the house and their laptop batter dies, is your office having an outage?

We are talking about system down time, not user whoopsies.

Your home user's ISP goes down or they lose power and they can't reach email, is it an outage?

If the user calls IT Guy and IT Guy says "It works for me", then again... It's classified as User issue, not "System Outage"

I don't think that it's as simple as "can't reach it."

Right, you are. Read the rest of my post, lol.

scottalanmiller

@Dashrender said:

I'm thinking that he actually saved more than a small penny by using fewer disks and probably cheaper servers. Also, since he didn't have to worry about replication, he didn't need 4+ 1 GB Teaming or 10 GB NICs, etc.

You are creating false scenarios to make it sound good when it wasn't. He had two servers plus the NAS all providing less than a single server, no NAS would have done. There was not just lost money in direct money lost, which was probably a lot likely around 50-60% of all spent money was wasted, but it additionally took on extra work (lost effort) and risk (lost safety.)

You are comparing it to a solution that he didn't achieve. You have to compare to what it did achieve. Lots of money was wasted.

scottalanmiller

@dafyre said:

Considering Off-site, you can use the same criteria, with the end result being that "Exchange is down for everyone" because you have no local Exchange servers. Unless you work in the DC where the Exchange servers reside, you don't know if it is an "Exchange Outage" or an "Internet Outage"... Thinking about a Hosted service, I would call it an Exchange Outage

Then even if it is out for your whole company, country or region... it is not an outage, because someone, somewhere can get to it?

Dashrender

I'm mostly going with @dafyre definition of an outage.

Sure, if my building has no power, we have a technical Exchange outage. But, do I care? Nope! why not? because my whole business is down.

So while in the truest sense of the term, yes it's and Exchange outage, but I don't count it against myself or Exchange because the rest of my business is down as well, and I can't work. Furthermore, unlike other businesses, we can easily live without email for 2-3 days if we had to, so it's even that much less of an issue for us personally - And this may also be tenable for most small businesses as well.

Now that said, we don't appear down to the outside world. We are using a hosted Spam/AV scanning solution (AppRiver). The email flows to them and they hold it until we come back online. The outside world never notices a thing.

scottalanmiller

I'm not saying that there is a good definition for an outage, but that it is difficult to define. I would say that you can have a service outage (end users affected) that doesn't imply that the service is down and you can have a service down. I think that the term outage needs to be tied to the fault point, not the working point.

So in our example from a few months ago, Azure was "out" for us. The fault was all with Azure. It was a failure of the Azure platform that left all services for us unavailable for a long period of time. People tried to say that Azure was not "out" because some people, not none of them being us, could get to Azure services.

Was that an outage? Most people could still use it. But the service to us was certainly "out" and it was out "at Azure" not at an ISP, not on our end, etc.

Dashrender

@scottalanmiller said:

@Dashrender said:

How did it loose money?

Because money was spent without benefit. Physical money that could have been saved, was spent. That's lost money.

explain to me the savings?

The only possible savings I could possibly see is if they would have only used one server instead of two. then all of the disk would be local to that one host, and they wouldn't have spent money on a second host or the NAS - is that what you mean?

DustinB3403

@Dashrender said:

@scottalanmiller said:

@Dashrender said:

How did it loose money?

Because money was spent without benefit. Physical money that could have been saved, was spent. That's lost money.

explain to me the savings?

The only possible savings I could possibly see is if they would have only used one server instead of two. then all of the disk would be local to that one host, and they wouldn't have spent money on a second host or the NAS - is that what you mean?

That is what he means, yes.

scottalanmiller

@Dashrender said:

I'm mostly going with @dafyre definition of an outage.

Sure, if my building has no power, we have a technical Exchange outage. But, do I care? Nope! why not? because my whole business is down.

So while in the truest sense of the term, yes it's and Exchange outage, but I don't count it against myself or Exchange because the rest of my business is down as well, and I can't work. Furthermore, unlike other businesses, we can easily live without email for 2-3 days if we had to, so it's even that much less of an issue for us personally - And this may also be tenable for most small businesses as well.

Now that said, we don't appear down to the outside world. We are using a hosted Spam/AV scanning solution (AppRiver). The email flows to them and they hold it until we come back online. The outside world never notices a thing.

But that's a problem that you are applying different criteria to the same service. If YOUR ISP goes down, an in house Exchange system is useless - no email can flow. If YOUR ISP goes down and you have hosted Exchange, Exchange is still up and usable, you have simple workarounds and the company can choose to keep using the service.

But even though the outage is with you, not Exchange, we generally look at the opposite which is very misleading, I feel.

scottalanmiller

@Dashrender said:

The only possible savings I could possibly see is if they would have only used one server instead of two. then all of the disk would be local to that one host, and they wouldn't have spent money on a second host or the NAS - is that what you mean?

Right, they bought three servers and got less value than buying only one. Now you couldn't quite get all of the cost gone from the three so you can't realistically save 67% of the cost. But 50-60% is easy. Two of the three devices were useless and actively hurt them by increasing risk, power consumption, IT effort, cabling, HVAC, space, etc.

So not only did the project fail to deliver on goals (HA and general protection) it also did not do so in a cost effective way. It lost a huge amount of money while additionally not successfully meeting goals.

Dashrender

Ok.. all of that makes sense - I'm wondering why you haven't brought that up in his thread? Instead he was more or less punted to Scale or XenServer with HA-Lizard (Hyper-v while mentioned was quickly dismissed it appeared).

I did this morning ask the question, why not just go to one single super awesome box... the answer 'HA.'

scottalanmiller

If you hire someone to get you a cost effective and simple means to move thousands of moving boxes from Philly to Seattle and you expect a large moving truck and he returns with a thousand Ferraris, was he successful?

The goal was to be cost effective. And to be simple.

Anyone can move boxes, we assume that the ability to actually move the boxes is below the line of success. Success is not in moving the boxes, because it's so simple that it is assumed success. What we needed from someone giving advice is cost effectiveness and simplicity.

Is driving a thousand Ferraris, each with one moving box strapped into the passenger seat, simpler than having a single box truck take all of the boxes at once in a single load? Is it more cost effective?

At what point is this a failure or a success? I would deem this a failure - this is about the biggest waste of resources in cost while not providing a superior solution. It's all bad. Each piece - simplicity, speed, cost are all below a reasonable baseline.

scottalanmiller

@Dashrender said:

I did this morning ask the question, why not just go to one single super awesome box... the answer 'HA.'

Pretty sure that HA was listed as a requirement. We definitely covered what I said above several times in stating that the NAS was not delivering any benefit and not meeting goals which is all it takes to get all that is stated above. Once the NAS is known to not have delivered HA, we know that all of the money around that was wasted.

Dashrender

@scottalanmiller said:

@Dashrender said:

I'm mostly going with @dafyre definition of an outage.

Sure, if my building has no power, we have a technical Exchange outage. But, do I care? Nope! why not? because my whole business is down.

So while in the truest sense of the term, yes it's and Exchange outage, but I don't count it against myself or Exchange because the rest of my business is down as well, and I can't work. Furthermore, unlike other businesses, we can easily live without email for 2-3 days if we had to, so it's even that much less of an issue for us personally - And this may also be tenable for most small businesses as well.

Now that said, we don't appear down to the outside world. We are using a hosted Spam/AV scanning solution (AppRiver). The email flows to them and they hold it until we come back online. The outside world never notices a thing.

But that's a problem that you are applying different criteria to the same service. If YOUR ISP goes down, an in house Exchange system is useless - no email can flow. If YOUR ISP goes down and you have hosted Exchange, Exchange is still up and usable, you have simple workarounds and the company can choose to keep using the service.

But even though the outage is with you, not Exchange, we generally look at the opposite which is very misleading, I feel.

I think a big difference in the comparisons comes from what you can see versus what you can't. If you're just an O365 user, you can't see the difference between an ISP outage/Azure outage/Exchange outage. To you, it's just an outage.

When you control all the pieces, local Exchange, ISP, Firewall, etc - you're able to more specifically understand the failure.

Plus with local services, assuming the main service itself or the local network itself is up, you can often get local access. That's never the case with hosted options. If the hosted solution is having an outage, it's because something is preventing the whole thing from working, even if it's only the ISP is down, the fact it no one can use it because everyone/everything is on the outside.

scottalanmiller

@Dashrender said:

I think a big difference in the comparisons comes from what you can see versus what you can't. If you're just an O365 user, you can't see the difference between an ISP outage/Azure outage/Exchange outage. To you, it's just an outage.

There is a reason that I don't like this, though. And that is that we define "Exchange is down" based on factors unrelated to Exchange. See the problem?

How often is Exchange expected to fail? Well that depends on the price of raisins on Tuesday. Huh?

Having Exchange deemed "up or down" based on factors unrelated to Exchange or the service seems like a really bad, and totally useless, idea. If we start doing that, why use the term outage at all because what value does it have for us?

dafyre

@scottalanmiller said:

@Dashrender said:

I did this morning ask the question, why not just go to one single super awesome box... the answer 'HA.'

Pretty sure that HA was listed as a requirement. We definitely covered what I said above several times in stating that the NAS was not delivering any benefit and not meeting goals which is all it takes to get all that is stated above. Once the NAS is known to not have delivered HA, we know that all of the money around that was wasted.

The Problem that most people have with @scottalanmiller 's definition of HA, is that he uses a much more deeply defined idea of HA than what most folks do. Most people really mean Redundancy when they say HA.... To answer the question of "Will it stay up if Server A explodes?"

If their response is, "Yes, it will stay up, because it will automatically fail over with little-to-no downtime to Server B"

Then a lot of people think they have High Availability, when what they really have is Redundancy.

scottalanmiller

@Dashrender said:

When you control all the pieces, local Exchange, ISP, Firewall, etc - you're able to more specifically understand the failure.

True, it assists in placing blame.

dafyre

@scottalanmiller said:

@Dashrender said:

When you control all the pieces, local Exchange, ISP, Firewall, etc - you're able to more specifically understand the failure.

True, it assists in placing blame.

Not my fault!