Resurrecting the Past: a Project Post Mortem
-
@scottalanmiller said:
@dafyre said:
We had roughly 16 physical servers at the time (we hadn't done much with virtualization yet).
Perspective Modern: Using virtualization and consolidation, down to how many physical hosts could this be collapsed today? Sixteen in 2007 would easily fit into one or two hosts today, I would assume. Was this sixteen for performance or some amount of them for other purposes (separation of duties, failover, etc.)
Response You are correct. We started with 2 x VMware servers and were able to go from 16 Physical servers down to 6.
-
@dafyre said:
Response You are correct. We started with 2 x VMware servers and were able to go from 16 Physical servers down to 6.
Still need six? What's keeping you from two?
-
@scottalanmiller said:
@dafyre said:
Response You are correct. We started with 2 x VMware servers and were able to go from 16 Physical servers down to 6.
Still need six? What's keeping you from two?
We ran out of RAM at the time, and initial testing with SQL Server made it not viable for our environment right then.
-
@dafyre said:
We ran out of RAM at the time, and initial testing with SQL Server made it not viable for our environment right then.
At the time, but now? Anything keeping you from two machines today? Considering you can get over a terabyte of memory per machine, is memory an issue?
-
How much memory did you need back then? Even in 2007 you could get an awful lot. I think we were deploying 64GB standard then but could go much larger when needed, but not by default. This was not in an SMB, in a Fortune 100, but per machine it still made sense.
-
I assume you have something like 10GigE available between sites?
-
@scottalanmiller said:
@dafyre said:
We ran out of RAM at the time, and initial testing with SQL Server made it not viable for our environment right then.
At the time, but now? Anything keeping you from two machines today? Considering you can get over a terabyte of memory per machine, is memory an issue?
The machines we had came with 32GB RAM (these machines had already been purchased for another project, but they got repurposed before that project got off the ground)
-
@scottalanmiller said:
I assume you have something like 10GigE available between sites?
No. The buildings are connected via 2 x 1gig fiber. The link utilization never really peaks out except when backups are running.
-
@dafyre said:
The machines we had came with 32GB RAM (these machines had already been purchased for another project, but they got repurposed before that project got off the ground)
That's another issue to be addressed. Why were machines purchased that were not needed (yet) and why are other projects being forced to make due with the rejects of failed projects?
Working with what you have is always something to be considered, but purchasing when not needed is never a good idea.
-
@scottalanmiller said:
Working with what you have is always something to be considered, but purchasing when not needed is never a good idea.
^ That x 1000. These machines were originally waaaaaaaay overkill for the project that we had coming down the pipe. That project was a done deal, signed and delivered. We purchased the machines for that specific project, and then had the fire that took out a building and pushed the administration over the edge with concern about major data loss. We decided that since they were so overkill that the VMs for that project would be our first forray into Virtualization.
-
What kind of workloads are we supporting here? AD I assume, SQL Server. Everything is on Windows? Any special systems needing special considerations? Which ones lack application layer fault tolerance?
-
From your description, it sounds like the business is not averse to outages, just wants to avoid dataloss and, of course, reduced downtime is always a plus. Is small amounts of downtime in case of a disaster acceptable?
For example, FT, HA and Async Failover take you from zero downtime to moments of downtime to minutes of downtime, but the cost and complexity drops dramatically and performance increases as you go left to right.
-
@scottalanmiller said:
What kind of workloads are we supporting here? AD I assume, SQL Server. Everything is on Windows? Any special systems needing special considerations? Which ones lack application layer fault tolerance?
At that time, the only thing with application-layer fault tolerance was AD. We have a Linux box that is for DHCP. There were 2 x Web Servers, 2 x Moodle Servers, 2 x Sharepoint Servers (separate pieces of the project), and the Standalone SQL Server (not virtualized... it was both virtualized and clustered later), a CLEP Testing Server, an IT Utility Server, and a few others. Several things were consolidated into the same server where it made sense... IE: our IT Utility server was also the OpenFire server, as well as our Spiceworks server (am I allowed to say that here? lol).
-
@scottalanmiller said:
From your description, it sounds like the business is not averse to outages, just wants to avoid dataloss and, of course, reduced downtime is always a plus. Is small amounts of downtime in case of a disaster acceptable?
That is probably an accurate description for the business back then. However, we in IT were tired of having to do restores and such two or three times a month. The business office told us they wanted us to keep things functional as much as possible, and fit it into $budget.
When we originally started the virtualization project, everything was running on a MD1000. That is the point at which we quit looking at DAS and started looking at SAN.
-
@scottalanmiller said:
For example, FT, HA and Async Failover take you from zero downtime to moments of downtime to minutes of downtime, but the cost and complexity drops dramatically and performance increases as you go left to right.
We did mostly HA after we got approval for the storage cluster (SAN, if I must call it that, lol). Our servers were hefty enough that one could absorb the VMs from the other if a host failed. Our storage cluster was truly HA -- we could lose either node and nobody would notice. We had 2 x fiber lines between the two buildings that took different paths. We did not have redundant network switches, as that was an oversight on our part, thus causing the mostly HA, lol.
Fortunately for us, there was never any unplanned down time from a failed switch.
-
@dafyre said:
That is probably an accurate description for the business back then. However, we in IT were tired of having to do restores and such two or three times a month.
How was that happening? That's a different issue. Even with no HA of any sort, a restore once a year should be a huge red flag. What was causing all of these restores?
-
@dafyre said:
Our storage cluster was truly HA -- we could lose either node and nobody would notice.
That doesn't describe HA. That describes redundancy and failover. It suggests an attempt at HA but in no way means that it was achieved. I can make something that is really fragile, is super cheap and is low availability (lower than a standard server) but fails over when I do a node test.
This redundancy instead of HA is actually what SAN vendors are famous for. Dell and HP, especially, have a track record of making entry level devices that are "fully redundant" but are LA, not SA or HA. They can demonstrate a failover anytime you want, but when they fail they rarely successfully failover and, in fact, fail far more often than a normal server.
So very careful to know how you determined that this setup is HA. Being that the nodes are in different physical locations makes that far more likely than if they were in a single chassis, but your description in no way suggests that there is HA without more details.
-
The posts about the Dell MD SAN and the EMC VNX SAN from a few hours ago are both examples of this. In both cases someone would demonstrate how yanking one of the controllers out would cause a fault and the other controller would transparently take over. In the case of a controller being yanked, they are pretty reliable. But in the real world, that's not what a failure looks like and typically failures are caused by firmware issues, not physical removal. And physical failures and physical removal are very different things as well. You rarely get split brain during a test, or have a shock to the system.
-
@dafyre said:
When we originally started the virtualization project, everything was running on a MD1000. That is the point at which we quit looking at DAS and started looking at SAN.
Only difference between a DAS and a SAN is the addition of the switching layer. DAS is always superior to SAN, all other things being equal, because it has lower latency and fewer points of failure. There are hard limits as to how many physical hosts can attach to any given DAS, but this can easily be in the hundreds. But a DAS and SAN are the same physical devices, just determined by how they are connected.
But when the option exists, you never choose a SAN unless a DAS can't do what you need physically as the DAS is simply faster and safer by the laws of physics.
-
@scottalanmiller said:
That doesn't describe HA. (snip)
How would you define HA, then? I mean is not the scenario I just described (ie: keeping the systems up despite a major failure?) a form of HA. Yes it is failover, but isn't failover (and failback) a part of HA? Isn't that the point of having a failover cluster?
NB: This was a 2-node + witness active / active cluster.
This redundancy instead of HA is actually what SAN vendors are famous for. Dell and HP, especially, have a track record of making entry level devices that are "fully redundant" but are LA, not SA or HA. They can demonstrate a failover anytime you want, but when they fail they rarely successfully failover and, in fact, fail far more often than a normal server.
Let's call it redundancy, then... our uptime went from 80% to 99% (guestimate based on experience). We were still in a much better situation than we were before. The offices that lost money while we were doing restores were no longer suffering from anywhere near as many interuptions from our servers being down. (This was before I learned a little more business sense to calculate the cost of down time, etc. We presented our solution to the business & financial folks, and they said do it. So we did).
@scottalanmiller said:
How was that happening? That's a different issue. Even with no HA of any sort, a restore once a year should be a huge red flag. What was causing all of these restores?
To this day, we are not sure. I think it was a faulty controller or something. After we got the SAN ^H^H^H storage cluster installed, we move the SQL Server's database files, etc, etc. off of the PowerVault and never looked back. (This was after the PV was out of warranty).