I work for a growing manufacturing company. We had 1 location when I started in 2007 and was the only one in IT. Now we have 4 members of the IT Department (all stationed at our corporate headquarters in Fort Worth) and a total of 10 sites to support (one of these is currently not yet fully operational). In the next 6 months, two of the existing sites will move into the site I mentioned as not yet being fully operational. But we will still have a total of 10 sites to support as I believe we will be adding a couple more by the end of the year.
In 2012 we finally started down the path of virtualizing our servers and put in 1 ESXi host with all local storage (about 2 TB of it on 10K SAS drives). Then, we added another ESXi host (with similar specs but better processors) in 2013 to finally virtualize our ERP system. We decided to make sure we had two servers with enough processor power, RAM, and IOPs to run everything in the event that one host died. We decided to put half the VMs on each host and use Veeam for backups. At the time the second server went in we were told that 2-4 hours would be fine as a RTO, and the RPO of the previous day's backup would be fine. Backups are taken offsite daily as well. So we left things as local storage with VSphere Essentials, which is what we still have today.
Since those two servers were put in, the number of users has grown, the number of VMs has grown, and the amount of storage in use has grown. We're at the point where neither server would have enough storage to run all VMs from the other server if one of them failed. That's a problem and creates an interesting DR situation.
Let's take servers at remote sites out of this discussion for right now. Some of the remote sites have / will have servers, but the core applications are hosted at HQ. The focus of this post is on those core applications which would be applicable to all sites.
The core applications in my mind hosted by HQ would be as follows (with the most critical of these being anything used by every single site):
-
Epicor ERP system (servicing all sites) - comprised of SQL server and 2 application servers
-
Exchange 2010 (servicing all sites) - left on site due to ITAR regulations
-
Web server (servicing all sites) - corporate access to ERP system data, contains many enhancements for production flow, used for electronic scheduling boards in the shops at almost every location
-
Bartender server - for label printing at many of our sites
-
Elastix server - PBX for most of our sites (but not all of them)
-
Sharepoint Foundation - contains Quality Management system data for most sites
-
Domain controllers (servicing all sites) - 2 of these virtual, 2 still physical
-
Solidworks ePDM system (servicing the site with largest revenue) - 2 servers
-
Unifi controller VM (servicing all sites) - 1 VM as controller for APs across all sites
-
File server with part programs (servicing a couple of sites) - part programs for machines stored here
-
Veeam server - for backup and restores
We're also getting to the point where the shops are using less and less paper. That means when a cutting operation is finished, the operator will eventually have no paper to tell him / her what to do next and must rely on an electronic scheduling board. If those don't work, we cannot make parts, and we lose money. Some cutting / bending operations could take hours before they would need to look at what comes next or transact with our ERP system, download the next part program, etc.
I started a conversation with the boss yesterday and mentioned with the push to go paperless, I was concerned about infrastructure (probably the only one who is concerned) not being resilient enough to hit our RTO. He said he thought a couple of hours of downtime was still ok, but as the downtime becomes longer, the more it will cost the company in lost manufacturing time.
But it won't just cost HQ, it will cost every single company under our corporate umbrella. Each remote site operates under its own name but has the same ownership and executive management team.
He basically said to put together what I think we need to do and how much it will cost. But I threw it back at him and said I really needed to know the cost of downtime to put together a solution they would actually approve and that would meet the needs of the business. It could be something as simple as getting another host that serves as a replication target for the other two production servers so we can flip the switch and turn on critical VMs should a host die. Or it could be something as fancy as throwing in a VSAN cluster and going up to VSphere Essentials Plus or even Standard.
When I mentioned the need to know cost of downtime, he suggested rather than talk to folks in operations and try to get them to ballpark that for us, we should talk to the execs about where we are now, how long it would take us to recover with what we have, and present possible solutions and costs to close that gap. But again, without knowing the cost of downtime, it's kind of like shooting in the dark to some extent. The more remote sites / companies are using our ERP system, the more critical it becomes, driving the cost of downtime up. And if the execs don't see dollars lost, they are less likely to shell out for much.
We do have a couple of spare HP servers from 2008 that might work to run ESXi, but neither has enough storage to run critical VMs should we have an issue with a host. And even if they did, would you rely on a host you bought in 2008 to run the latest version of VSphere and have it work properly to run VMs that are critical to your business should you be presented with a disaster? In my mind if we had a host fail, Id be messing with our older servers to see if they work first and then heading to a Fry's or MicroCenter to buy a small server to help us recover, which may or may not run ESXi. I know for a fact the boss will ask why we can't use that old equipment for something.
I'm not sure what I am looking for here in terms of a response. I think I had the right approach to get cost of downtime and to be prepared with a feasible solution based on RTO, RPO, and cost of downtime. I'd love to hear thoughts from anyone out there who wants to contribute.