Resurrecting the Past: a Project Post Mortem

scottalanmiller

Response You are correct. We started with 2 x VMware servers and were able to go from 16 Physical servers down to 6.

Still need six? What's keeping you from two?

dafyre

@scottalanmiller said:

@dafyre said:

Response You are correct. We started with 2 x VMware servers and were able to go from 16 Physical servers down to 6.

Still need six? What's keeping you from two?

We ran out of RAM at the time, and initial testing with SQL Server made it not viable for our environment right then.

scottalanmiller

@dafyre said:

We ran out of RAM at the time, and initial testing with SQL Server made it not viable for our environment right then.

At the time, but now? Anything keeping you from two machines today? Considering you can get over a terabyte of memory per machine, is memory an issue?

scottalanmiller

How much memory did you need back then? Even in 2007 you could get an awful lot. I think we were deploying 64GB standard then but could go much larger when needed, but not by default. This was not in an SMB, in a Fortune 100, but per machine it still made sense.

scottalanmiller

I assume you have something like 10GigE available between sites?

dafyre

@scottalanmiller said:

@dafyre said:

We ran out of RAM at the time, and initial testing with SQL Server made it not viable for our environment right then.

At the time, but now? Anything keeping you from two machines today? Considering you can get over a terabyte of memory per machine, is memory an issue?

The machines we had came with 32GB RAM (these machines had already been purchased for another project, but they got repurposed before that project got off the ground)

dafyre

@scottalanmiller said:

I assume you have something like 10GigE available between sites?

No. The buildings are connected via 2 x 1gig fiber. The link utilization never really peaks out except when backups are running.

scottalanmiller

@dafyre said:

The machines we had came with 32GB RAM (these machines had already been purchased for another project, but they got repurposed before that project got off the ground)

That's another issue to be addressed. Why were machines purchased that were not needed (yet) and why are other projects being forced to make due with the rejects of failed projects?

Working with what you have is always something to be considered, but purchasing when not needed is never a good idea.

dafyre

@scottalanmiller said:

Working with what you have is always something to be considered, but purchasing when not needed is never a good idea.

^ That x 1000. These machines were originally waaaaaaaay overkill for the project that we had coming down the pipe. That project was a done deal, signed and delivered. We purchased the machines for that specific project, and then had the fire that took out a building and pushed the administration over the edge with concern about major data loss. We decided that since they were so overkill that the VMs for that project would be our first forray into Virtualization.

scottalanmiller

What kind of workloads are we supporting here? AD I assume, SQL Server. Everything is on Windows? Any special systems needing special considerations? Which ones lack application layer fault tolerance?

scottalanmiller

From your description, it sounds like the business is not averse to outages, just wants to avoid dataloss and, of course, reduced downtime is always a plus. Is small amounts of downtime in case of a disaster acceptable?

For example, FT, HA and Async Failover take you from zero downtime to moments of downtime to minutes of downtime, but the cost and complexity drops dramatically and performance increases as you go left to right.

dafyre

@scottalanmiller said:

What kind of workloads are we supporting here? AD I assume, SQL Server. Everything is on Windows? Any special systems needing special considerations? Which ones lack application layer fault tolerance?

At that time, the only thing with application-layer fault tolerance was AD. We have a Linux box that is for DHCP. There were 2 x Web Servers, 2 x Moodle Servers, 2 x Sharepoint Servers (separate pieces of the project), and the Standalone SQL Server (not virtualized... it was both virtualized and clustered later), a CLEP Testing Server, an IT Utility Server, and a few others. Several things were consolidated into the same server where it made sense... IE: our IT Utility server was also the OpenFire server, as well as our Spiceworks server (am I allowed to say that here? lol).

dafyre

@scottalanmiller said:

From your description, it sounds like the business is not averse to outages, just wants to avoid dataloss and, of course, reduced downtime is always a plus. Is small amounts of downtime in case of a disaster acceptable?

That is probably an accurate description for the business back then. However, we in IT were tired of having to do restores and such two or three times a month. The business office told us they wanted us to keep things functional as much as possible, and fit it into $budget.

When we originally started the virtualization project, everything was running on a MD1000. That is the point at which we quit looking at DAS and started looking at SAN.

dafyre

@scottalanmiller said:

For example, FT, HA and Async Failover take you from zero downtime to moments of downtime to minutes of downtime, but the cost and complexity drops dramatically and performance increases as you go left to right.

We did mostly HA after we got approval for the storage cluster (SAN, if I must call it that, lol). Our servers were hefty enough that one could absorb the VMs from the other if a host failed. Our storage cluster was truly HA -- we could lose either node and nobody would notice. We had 2 x fiber lines between the two buildings that took different paths. We did not have redundant network switches, as that was an oversight on our part, thus causing the mostly HA, lol.

Fortunately for us, there was never any unplanned down time from a failed switch.

scottalanmiller

@dafyre said:

That is probably an accurate description for the business back then. However, we in IT were tired of having to do restores and such two or three times a month.

How was that happening? That's a different issue. Even with no HA of any sort, a restore once a year should be a huge red flag. What was causing all of these restores?

scottalanmiller

@dafyre said:

Our storage cluster was truly HA -- we could lose either node and nobody would notice.

That doesn't describe HA. That describes redundancy and failover. It suggests an attempt at HA but in no way means that it was achieved. I can make something that is really fragile, is super cheap and is low availability (lower than a standard server) but fails over when I do a node test.

This redundancy instead of HA is actually what SAN vendors are famous for. Dell and HP, especially, have a track record of making entry level devices that are "fully redundant" but are LA, not SA or HA. They can demonstrate a failover anytime you want, but when they fail they rarely successfully failover and, in fact, fail far more often than a normal server.

So very careful to know how you determined that this setup is HA. Being that the nodes are in different physical locations makes that far more likely than if they were in a single chassis, but your description in no way suggests that there is HA without more details.

scottalanmiller

The posts about the Dell MD SAN and the EMC VNX SAN from a few hours ago are both examples of this. In both cases someone would demonstrate how yanking one of the controllers out would cause a fault and the other controller would transparently take over. In the case of a controller being yanked, they are pretty reliable. But in the real world, that's not what a failure looks like and typically failures are caused by firmware issues, not physical removal. And physical failures and physical removal are very different things as well. You rarely get split brain during a test, or have a shock to the system.

scottalanmiller

@dafyre said:

When we originally started the virtualization project, everything was running on a MD1000. That is the point at which we quit looking at DAS and started looking at SAN.

Only difference between a DAS and a SAN is the addition of the switching layer. DAS is always superior to SAN, all other things being equal, because it has lower latency and fewer points of failure. There are hard limits as to how many physical hosts can attach to any given DAS, but this can easily be in the hundreds. But a DAS and SAN are the same physical devices, just determined by how they are connected.

But when the option exists, you never choose a SAN unless a DAS can't do what you need physically as the DAS is simply faster and safer by the laws of physics.

dafyre

@scottalanmiller said:

That doesn't describe HA. (snip)

How would you define HA, then? I mean is not the scenario I just described (ie: keeping the systems up despite a major failure?) a form of HA. Yes it is failover, but isn't failover (and failback) a part of HA? Isn't that the point of having a failover cluster?

NB: This was a 2-node + witness active / active cluster.

This redundancy instead of HA is actually what SAN vendors are famous for. Dell and HP, especially, have a track record of making entry level devices that are "fully redundant" but are LA, not SA or HA. They can demonstrate a failover anytime you want, but when they fail they rarely successfully failover and, in fact, fail far more often than a normal server.

Let's call it redundancy, then... our uptime went from 80% to 99% (guestimate based on experience). We were still in a much better situation than we were before. The offices that lost money while we were doing restores were no longer suffering from anywhere near as many interuptions from our servers being down. (This was before I learned a little more business sense to calculate the cost of down time, etc. We presented our solution to the business & financial folks, and they said do it. So we did).

@scottalanmiller said:

How was that happening? That's a different issue. Even with no HA of any sort, a restore once a year should be a huge red flag. What was causing all of these restores?

To this day, we are not sure. I think it was a faulty controller or something. After we got the SAN ^H^H^H storage cluster installed, we move the SQL Server's database files, etc, etc. off of the PowerVault and never looked back. (This was after the PV was out of warranty).

scottalanmiller

@dafyre said:

How would you define HA, then?

Pretty easy, actually. HA = High Availability. So let's start be defining Standard Availability.

What is SA? The most common accepted example of SA is to take a well treated, enterprise commodity server (in the case of systems HA like we are discussing here) and treat that as the baseline. This would generally be the big, mainline servers such as an HP DL380 or a Dell R730. These are the best selling servers in the world, and they are the standard "middle of the road" servers from the enterprise vendors. These are neither the entry level devices nor special case devices. These are the bulk of sales, the standard against which all others are measured and designed for general purpose computing.

So: SA can be defined as the availability of a system consisting of a single, properly maintained enterprise commodity server.

Therefore, what is HA? HA is then simple to describe: High availability is a system that significantly improves upon the availability of an SA system.

Being 1% better isn't enough. And it doesn't matter how it is achieved. The terms give it all away - it is about availability and nothing else. Any other aspect is misdirection and a red herring. IT departments and vendors tend to focus on the "other things" because they are simple. Is this redundant? Simple yes or no. But is it reliable? Oh, I'd have to think about that.