Cart before the Horse with RPO and RTO - Growing Core Infrastructure with the Company
-
You definitely do not need to know cost of downtime to work on solutions. You will need to know that in order to present which option will best fit the needs of the business.
I know you are invested in hardware already, but your workload sounds like a really nice fit for a Scale cluster. I would most certainly look very hard at that before jumping up your VMWare subscriptions.
I would look at the systems and see if memory and drives can feasibly be added.Something along the lines of dropping all the 10K drives for larger NL SAS drives and then getting some SSD in RAID5 for your high speed needs.
Obviously with multiple arrays you need to plan a little more carefully on your VM locations, but that is a really good choice that should allow you to keep your existing hardware mostly in tact.
-
@NetworkNerd said:
When I mentioned the need to know cost of downtime, he suggested rather than talk to folks in operations and try to get them to ballpark that for us, we should talk to the execs about where we are now, how long it would take us to recover with what we have, and present possible solutions and costs to close that gap.
I remember this conversation from many years ago, when you tried to get the cost of downtime when putting in the current system, they wanted you to act as the financial department and tell the financial people what they should be telling you. Sounds like in four years, they are still not on top of the CFO's office and hoping that IT will fill the gap? They are still missing the basic idea that closing the gap doesn't matter because no one knows what the gap is.
-
Memory can be added to either host without any trouble. The HP DL385 G7 is maxed out at 16 drives (requires all new drives to up the capacity), but the Cisco UCS 240 has only 16 drive bays in use with 8 free. If we added SSDs to the Cisco server and made a datastore just for Epicor VMs, that certainly wouldn't hurt. We'd just have to make sure we got the right type of SSDs for the workload, which is about 80/20 read / write I believe.
-
There is another way to look at this.... IT should be able to produce RPO/RTO numbers based on the infrastructure. Do this and let other departments determine if there is a reason to not like those numbers. IT should not "worry" about this as long as the departments know the facts. If the RTO/RPO is too long, they should come back to you rather than you guessing that the numbers are not good for them. They have the numbers and know if it is so, you are just guessing.
-
@NetworkNerd said:
We do have a couple of spare HP servers from 2008 that might work to run ESXi, but neither has enough storage to run critical VMs should we have an issue with a host. And even if they did, would you rely on a host you bought in 2008 to run the latest version of VSphere and have it work properly to run VMs that are critical to your business should you be presented with a disaster? I
It's a DR scenario. In many cases, yes I would. You would have vSphere installed and ready to go. The server should be regularly tested. That it is from 2008 is pretty minor concern. If it lacks memory or capacity, that is its own concern. But a DR system from 2008 isn't a very big deal on its own.
-
@scottalanmiller said:
@NetworkNerd said:
We do have a couple of spare HP servers from 2008 that might work to run ESXi, but neither has enough storage to run critical VMs should we have an issue with a host. And even if they did, would you rely on a host you bought in 2008 to run the latest version of VSphere and have it work properly to run VMs that are critical to your business should you be presented with a disaster?
It's a DR scenario. In many cases, yes I would. You would have vSphere installed and ready to go. The server should be regularly tested. That it is from 2008 is pretty minor concern. If it lacks memory or capacity, that is its own concern. But a DR system from 2008 isn't a very big deal on its own.
You make a good point. I'd say without VSphere on them right now and confirmation that all drives and components are functional, that adds an hour or two to RTO off the bat (far more if the spares won't work) before I can even start restoring VMs.
But if we replicated the critical VMs to those hosts with Veeam even once per day (assuming we have enough storage - one of them has very little while the other has around 1 TB), turning on the replicas wouldn't take long if a host tanked.
-
@NetworkNerd said:
You make a good point. I'd say without VSphere on them right now and confirmation that all drives and components are functional, that adds an hour or two to RTO off the bat (far more if the spares won't work) before I can even start restoring VMs.
But if we replicated the critical VMs to those hosts with Veeam even once per day (assuming we have enough storage - one of them has very little while the other has around 1 TB), turning on the replicas wouldn't take long if a host tanked.
Even without replicating the data (although if you have the capacity, that's best since there appears to be no additional cost involved) just having vSphere installed, updated and tested would save a lot.
-
@scottalanmiller said:
@NetworkNerd said:
When I mentioned the need to know cost of downtime, he suggested rather than talk to folks in operations and try to get them to ballpark that for us, we should talk to the execs about where we are now, how long it would take us to recover with what we have, and present possible solutions and costs to close that gap.
I remember this conversation from many years ago, when you tried to get the cost of downtime when putting in the current system, they wanted you to act as the financial department and tell the financial people what they should be telling you. Sounds like in four years, they are still not on top of the CFO's office and hoping that IT will fill the gap? They are still missing the basic idea that closing the gap doesn't matter because no one knows what the gap is.
I think someone can help quantify the gap, and I'm hoping I can find that individual and get them to help me. I seem to be the one who is most interested in gap insurance.
-
@NetworkNerd said:
I think someone can help quantify the gap, and I'm hoping I can find that individual and get them to help me. I seem to be the one who is most interested in gap insurance.
This should be the red flag. Why is IT driving financial decisions? It should not. It should be a partner is helping meet operational and financial goals, but it should not be taking over the role of CFO and telling the business how to run. If the CEO does not share your concern, it means that your concern is not aligned with the business.
This is what I call "AJ-ism", and it is pretty common in IT. AJ got a little famous for this because he went beyond "concern" to literally being willing to lose his job fighting the business over trying to make it do what it did not agree needed to be done. Unless your fight is for ethics or safety, IT should not be taking the lead here, at all.
If factors change, IT reminding people that RTO has expanded due to load changes, presenting new costs because things have changed or whatever is one thing. But trying to convince the owners that their financial planning isn't as good as yours and that you should be driving the financial decisions of the company is a fundamentally wrong course for IT. If this is even slightly the case, you should be in the CFO's office running finance, not in IT, because you'd be far more valuable there.
-
Look at it another way... your boss wants to keep you happy. But they just told you, flat out, that your concern isn't important enough to the business for them to even be willing to give you the necessary numbers to figure out another course of action. To me, it sounds like they politely shot down your project. If you come back with "zero spend" options, maybe they will like that. Probably they will like that. But if sounds like if money needs to be spent, they have told you that they don't really want to talk about it.
-
@NetworkNerd said:
Unifi controller VM (servicing all sites) - 1 VM as controller for APs across all sites
This probably won't make a big difference at all, but could this be put on a hosted VM somewhere? That would at least alleviate restoring this VM if something happens.
Same with the Elastix server. At least in a DR scenario this would still be running.
-
@johnhooks said:
@NetworkNerd said:
Unifi controller VM (servicing all sites) - 1 VM as controller for APs across all sites
This probably won't make a big difference at all, but could this be put on a hosted VM somewhere? That would at least alleviate restoring this VM if something happens.
Same with the Elastix server. At least in a DR scenario this would still be running.
They certainly could. Those are definitely good suggestions that take the heat off the infrastructure at HQ. I'll have to see what pricing is like to do that. Thanks.
-
@johnhooks said:
@NetworkNerd said:
Unifi controller VM (servicing all sites) - 1 VM as controller for APs across all sites
This probably won't make a big difference at all, but could this be put on a hosted VM somewhere? That would at least alleviate restoring this VM if something happens.
That wouldn't make sense. At least not for the reasons you state. Remember there needs to be valid business reasons for doing this. In the Case of DR it's to provide business continuity. What business disruption will be caused if the unifi controller is down?
-
@Jason said:
@johnhooks said:
@NetworkNerd said:
Unifi controller VM (servicing all sites) - 1 VM as controller for APs across all sites
This probably won't make a big difference at all, but could this be put on a hosted VM somewhere? That would at least alleviate restoring this VM if something happens.
That wouldn't make sense. At least not for the reasons you state. Remember there needs to be valid business reasons for doing this. In the Case of DR it's to provide business continuity. What business disruption will be caused if the unifi controller is down?
There most likely wouldn't be a disruption, but it's one less thing to worry about in a DR situation and it helps with this issue
Since those two servers were put in, the number of users has grown, the number of VMs has grown, and the amount of storage in use has grown. We're at the point where neither server would have enough storage to run all VMs from the other server if one of them failed. That's a problem and creates an interesting DR situation.
It could most likely be run on the lowest tier DO or Vultr server for ~$5 a month. It was also listed as a core application, so having that off site if possible would be better.
-
@johnhooks said:
@Jason said:
@johnhooks said:
@NetworkNerd said:
Unifi controller VM (servicing all sites) - 1 VM as controller for APs across all sites
This probably won't make a big difference at all, but could this be put on a hosted VM somewhere? That would at least alleviate restoring this VM if something happens.
That wouldn't make sense. At least not for the reasons you state. Remember there needs to be valid business reasons for doing this. In the Case of DR it's to provide business continuity. What business disruption will be caused if the unifi controller is down?
There most likely wouldn't be a disruption, but it's one less thing to worry about in a DR situation and it helps with this issue
You wouldn't be worrying about something that provides no business continuity in a DR situation. That would be something you can deal with much later. The focus should be on things that directly have a monetary impact on the business.
-
@johnhooks said:
There most likely wouldn't be a disruption, but it's one less thing to worry about in a DR situation and it helps with this issue
No disruption means no DR worry. Paying for DR facilities to eliminate IT effort that has no business impact cost is very hard to justify. It's a difficult conversation to have with the CEO "Well, we are going to pay for this extra project and monthly cost so that I don't have to do as much work." There are cases where the work reduction really does justify that, but this doesn't feel like one of those.
-
It's also now removed off of two servers that are already over committed. With both it and elastix gone, that frees up resources for something else. While both VMs are minimal it could still help.
-
@scottalanmiller said:
@johnhooks said:
There most likely wouldn't be a disruption, but it's one less thing to worry about in a DR situation and it helps with this issue
No disruption means no DR worry. Paying for DR facilities to eliminate IT effort that has no business impact cost is very hard to justify. It's a difficult conversation to have with the CEO "Well, we are going to pay for this extra project and monthly cost so that I don't have to do as much work." There are cases where the work reduction really does justify that, but this doesn't feel like one of those.
No worry for that, but there is still worry about the other systems that won't all fit on a single server. It may be minimal, but it's still freeing up resources. And it's only $5 a month.
-
@johnhooks said:
It's also now removed off of two servers that are already over committed. With both it and elastix gone, that frees up resources for something else. While both VMs are minimal it could still help.
True, that helps with capacity a little. Although VERY little, we assume.
-
@scottalanmiller said:
@johnhooks said:
It's also now removed off of two servers that are already over committed. With both it and elastix gone, that frees up resources for something else. While both VMs are minimal it could still help.
True, that helps with capacity a little. Although VERY little, we assume.
Right but if you're that low on resources, every little bit helps. Especially if you have to overtax by trying to add more to an already over committed server in a bad scenario.