Domain Controller Down (VM)

Dashrender

I didn't know what kind of medical facility @wirestyle22 was in..

OK since the place is 24/7, he needs a higher than normal amount of uptime - fine. But real HA? really? I know XenServer and Hyper-V can both do storage motion while the system is running, so no shared storage is needed (granted XS is super slow, sooooo) so you don't need HA to do patches, you just need the storage motion options - I don't know if that's available in ESXi Essentials or not.

If HA is fully thought out and is felt is needed (don't forget about the power situation, and cooling, etc, etc, etc, - remember HA isn't a product, it's a process) then they should fully realize it. I'm guessing by the fact that the switches were 100 Mb that it really wasn't fully thought out, instead someone in the place of authority thought it sounded good and they tossed what they have in today in.

As for the rest, I generally agree with you. It shows the real costs of DOING IT RIGHT - but as most of us know - few SMBs are really willing to do what's right in IT.

Hell, just look at all of the threads in SW talking about print shops that couldn't upgrade their XP machines because their 10K+ printers didn't support anything newer. it's a never ending problem of knowing the real costs of doing something right.

StorageNinja

@Dashrender said in Domain Controller Down (VM):

OK since the place is 24/7, he needs a higher than normal amount of uptime - fine. But real HA? really? I know XenServer and Hyper-V can both do storage motion while the system is running, so no shared storage is needed (granted XS is super slow, sooooo) so you don't need HA to do patches, you just need the storage motion options - I don't know if that's available in ESXi Essentials or not.

I can't disagree more. I've seen someone try to do this in a SMB and they got fired.
It is available in ESXi (its a bit faster in 5.5 ESXi has a proper IO mirror driver so you don't have helper snapshots in a never ending catch up process to handle the IO happening during the merge of snapshots).

Doing shared nothing migrations impacts performance (Seriously, look at the disk latency the next time you do it. Telling management "well we kicked off the migration 7 hours ago and we can't really stop it" is a great way to get shown the door.
This doesn't scale, and can make patch windows take DAYS very quickly. No one would seriously consider this for monthly patching.
If you have high enough IO and are using a hypervisor that lacks a mirror driver you end up with an never ending amount of snapshot merges.

scottalanmiller

@Dashrender said in Domain Controller Down (VM):

OK since the place is 24/7, he needs a higher than normal amount of uptime - fine. But real HA? really? I know XenServer and Hyper-V can both do storage motion while the system is running, so no shared storage is needed (granted XS is super slow, sooooo) so you don't need HA to do patches, you just need the storage motion options - I don't know if that's available in ESXi Essentials or not.

Storage motion is not for production hours. That's great if you have a greenzone, but if you have that you don't need the storage motion. Storage motion is mostly for migrations and one time, unavailable events. It's not something you do during production time unless you have no choice (dying storage system.)

scottalanmiller

@Dashrender said in Domain Controller Down (VM):

If HA is fully thought out and is felt is needed (don't forget about the power situation, and cooling, etc, etc, etc, - remember HA isn't a product, it's a process) then they should fully realize it. I'm guessing by the fact that the switches were 100 Mb that it really wasn't fully thought out, instead someone in the place of authority thought it sounded good and they tossed what they have in today in.

It's as simple as "there was no HA and no attempt made at it."

StorageNinja

@Dashrender said in Domain Controller Down (VM):

I didn't know what kind of medical facility @wirestyle22 was in..

If HA is fully thought out and is felt is needed (don't forget about the power situation, and cooling, etc, etc, etc, - remember HA isn't a product, it's a process) then they should fully realize it. I'm guessing by the fact that the switches were 100 Mb that it really wasn't fully thought out, instead someone in the place of authority thought it sounded good and they tossed what they have in today in.

Medical facilities with beds have generators and fuel. HVAC for something this small can be covered for redundancy with a spot cooler (I have this in my own house for my lab, so If I can afford it, you have to be a tiny outfit to not be able to afford it). I agree its a process, and the biggest piece is having a MSP to back you up, and having 24/7 dispatched resources to help you with the persistent layer. Not having redundancy at the people level is the biggest issue to address. While I normally advocate some kind of offsite ready to fire off DR, in the case of a facility like this its not actually as important (beyond BC reasons) because if the whole facility blows up the need for the system goes with it. Still there are a bazillion Veeam/VCAN partners who can cover this piece for cheap so why not.

StorageNinja

@scottalanmiller said in Domain Controller Down (VM):

@Dashrender said in Domain Controller Down (VM):

If HA is fully thought out and is felt is needed (don't forget about the power situation, and cooling, etc, etc, etc, - remember HA isn't a product, it's a process) then they should fully realize it. I'm guessing by the fact that the switches were 100 Mb that it really wasn't fully thought out, instead someone in the place of authority thought it sounded good and they tossed what they have in today in.

It's as simple as "there was no HA and no attempt made at it."

It would take me about 5 minutes to explain to a 3rd grader why the system he has isn't redundant is bad. The fact that it continues to exist shows that either...

Management has the intellectual capacity below a 3rd grader (possible)
No one in non-jargon english explained how bad this configuration was. (more likely).

scottalanmiller

@John-Nicholson said in Domain Controller Down (VM):

Its a medical facility that has beds occupied 24/7 so yes.

That doesn't mean that. We can equally say they didn't have 24x7 IT staff so they don't need it. What they need, we have no way of knowing. If we read back what we know about their environment, it tells us that they didn't think that they needed HA in any way whatsoever. But that's all we have to go on. They operate around the clock, but that isn't an HA concern. And they implemented something so far from HA that it is laughable. So all we know is that they implemented anti-HA and spent a lot to do it. That's it. We have no indication that HA is warranted in any way.

Just because a shop is 24x7 medical doesn't tell us that a specific system is needed 24x7 or that it needs to be available at all times. Those are very different requirements.

StorageNinja

@Dashrender said in Domain Controller Down (VM):

As for the rest, I generally agree with you. It shows the real costs of DOING IT RIGHT - but as most of us know - few SMBs are really willing to do what's right in IT.
Hell, just look at all of the threads in SW talking about print shops that couldn't upgrade their XP machines because their 10K+ printers didn't support anything newer. it's a never ending problem of knowing the real costs of doing something right.

The real cost of doing IT right is cheaper. Simply not having an onsite FTE, and having a MSP manage this stuff is likely cheaper (FTE's are expensive!). This outage might have been embarrassing enough for them to loose a patient or two (or worse someone die, and they get hit with a million wrongful death dollar lawsuit that spikes their premiums). Doing IT RIGHT includes understanding the capex and opex costs, and associated risks and external costs of doing IT right or wrong.

Doing IT Wrong means wasting tons of money and getting an output that causes other costs. IT budgets do NOT exist in a vacuum to the rest of the operations and their output (Especially in 2016!).

scottalanmiller

@John-Nicholson said in Domain Controller Down (VM):

A proper MSP is like having an enterprise support army in your back pocket for less than the cost of a FTE. Honestly as a SMB you shouldn't hire an in house resource before you hire a MSP first, and any shop that doesn't want to pay for a MSP but will pay for a FTE is a GIANT red flag that they lack any level of competence in IT governance, budgeting, or common sense.

I agree. Anyone going into an FTE role in an SMB should probably ask what their MSP ecosystem of support is like BEFORE accepting a position. That's something that we never talk about but is a great idea. They should either have a great answer (and the MSP should be likely part of the interview process) or they should be like "that's why we are bringing you in, to help us find those good resources."

scottalanmiller

@John-Nicholson said in Domain Controller Down (VM):

@Dashrender said in Domain Controller Down (VM):

As for the rest, I generally agree with you. It shows the real costs of DOING IT RIGHT - but as most of us know - few SMBs are really willing to do what's right in IT.
Hell, just look at all of the threads in SW talking about print shops that couldn't upgrade their XP machines because their 10K+ printers didn't support anything newer. it's a never ending problem of knowing the real costs of doing something right.

The real cost of doing IT right is cheaper. Simply not having an onsite FTE, and having a MSP manage this stuff is likely cheaper (FTE's are expensive!). This outage might have been embarrassing enough for them to loose a patient or two (or worse someone die, and they get hit with a million wrongful death dollar lawsuit that spikes their premiums). Doing IT RIGHT includes understanding the capex and opex costs, and associated risks and external costs of doing IT right or wrong.

Doing IT Wrong means wasting tons of money and getting an output that causes other costs. IT budgets do NOT exist in a vacuum to the rest of the operations and their output (Especially in 2016!).

"IT Right" isn't even a thing. IT is just part of the business. It's "running the business right."

scottalanmiller

We actually did a video on that last night, it is being edited right now.

StorageNinja

@scottalanmiller said in Domain Controller Down (VM):

@John-Nicholson said in Domain Controller Down (VM):

Its a medical facility that has beds occupied 24/7 so yes.

That doesn't mean that. We can equally say they didn't have 24x7 IT staff so they don't need it. What they need, we have no way of knowing. If we read back what we know about their environment, it tells us that they didn't think that they needed HA in any way whatsoever. But that's all we have to go on. They operate around the clock, but that isn't an HA concern. And they implemented something so far from HA that it is laughable. So all we know is that they implemented anti-HA and spent a lot to do it. That's it. We have no indication that HA is warranted in any way.

Just because a shop is 24x7 medical doesn't tell us that a specific system is needed 24x7 or that it needs to be available at all times. Those are very different requirements.

EMR's on the system, I've yet to meet a medical facility who's SLA accepts a 12 hour outage for that, that has 24/7 manned beds, and ultimately its two things that drive medical standards and outcomes.

In America, a Jury in a wrongful death situation that are the arbitrators of what was acceptable or not in medical spending and outcomes (Lawsuits drive medical standards)
The federal government's willingness to reimburse you for spending. Any patent care administered while the system was down and not recording is not paid for, and while people have downtime procedures the risk of missing out on some juicy procedures or pills or other things means this can add up quickly.

scottalanmiller

@John-Nicholson said in Domain Controller Down (VM):

@scottalanmiller said in Domain Controller Down (VM):

@John-Nicholson said in Domain Controller Down (VM):

Its a medical facility that has beds occupied 24/7 so yes.

That doesn't mean that. We can equally say they didn't have 24x7 IT staff so they don't need it. What they need, we have no way of knowing. If we read back what we know about their environment, it tells us that they didn't think that they needed HA in any way whatsoever. But that's all we have to go on. They operate around the clock, but that isn't an HA concern. And they implemented something so far from HA that it is laughable. So all we know is that they implemented anti-HA and spent a lot to do it. That's it. We have no indication that HA is warranted in any way.

Just because a shop is 24x7 medical doesn't tell us that a specific system is needed 24x7 or that it needs to be available at all times. Those are very different requirements.

EMR's on the system, I've yet to meet a medical facility who's SLA accepts a 12 hour outage for that, that has 24/7 manned beds, and ultimately its two things that drive medical standards and outcomes.

In America, a Jury in a wrongful death situation that are the arbitrators of what was acceptable or not in medical spending and outcomes (Lawsuits drive medical standards)

The federal government's willingness to reimburse you for spending. Any patent care administered while the system was down and not recording is not paid for, and while people have downtime procedures the risk of missing out on some juicy procedures or pills or other things means this can add up quickly.

That's fine, BUT the ONLY thing we know for certain is what they were willing to implement previously. We don't know what kind of medicine the work in, what risks there are, what EMR dependencies there are. Sure they can't bill for twelve hours, but that might cost them nothing while uptime costs something. All depends. What we DO know is that they didn't have the hardware, planning, documentation, staff or support organizations for anything other than what they got. So based on the sole information that we have, we can't assume that their business believes in uptime. Even during the outage, they made it VERY clear that getting it fixed was not a priority but that status updates, conversations and even other IT needs were the priority.

We have a pretty uniform picture that uptime on this system is not perceived as important by the business decision makes, even during the panic fire of a real outage.

scottalanmiller

I totally understand that there are medical situations where high availability and high uptime are considered necessary and make sense in a business context. And I totally agree that this has the potential to be one of them. I'm only saying that it being possible doesn't make it so and that all indications from reading back their previous decisions, investments and behaviour suggest that they do not agree with that assessment.

Dashrender

@John-Nicholson said in Domain Controller Down (VM):

EMR's on the system, I've yet to meet a medical facility who's SLA accepts a 12 hour outage for that, that has 24/7 manned beds, and ultimately its two things that drive medical standards and outcomes.

In America, a Jury in a wrongful death situation that are the arbitrators of what was acceptable or not in medical spending and outcomes (Lawsuits drive medical standards)

The federal government's willingness to reimburse you for spending. Any patent care administered while the system was down and not recording is not paid for, and while people have downtime procedures the risk of missing out on some juicy procedures or pills or other things means this can add up quickly.

1 - take your word for it at this point
2 - what prevents you from documenting on paper and then entering when the system comes up - every one I know operates this way, and they do get paid for those things that are transposed to electronic after the fact.

Dashrender

@scottalanmiller said in Domain Controller Down (VM):

@John-Nicholson said in Domain Controller Down (VM):

@scottalanmiller said in Domain Controller Down (VM):

@John-Nicholson said in Domain Controller Down (VM):

Its a medical facility that has beds occupied 24/7 so yes.

That doesn't mean that. We can equally say they didn't have 24x7 IT staff so they don't need it. What they need, we have no way of knowing. If we read back what we know about their environment, it tells us that they didn't think that they needed HA in any way whatsoever. But that's all we have to go on. They operate around the clock, but that isn't an HA concern. And they implemented something so far from HA that it is laughable. So all we know is that they implemented anti-HA and spent a lot to do it. That's it. We have no indication that HA is warranted in any way.

Just because a shop is 24x7 medical doesn't tell us that a specific system is needed 24x7 or that it needs to be available at all times. Those are very different requirements.

EMR's on the system, I've yet to meet a medical facility who's SLA accepts a 12 hour outage for that, that has 24/7 manned beds, and ultimately its two things that drive medical standards and outcomes.

In America, a Jury in a wrongful death situation that are the arbitrators of what was acceptable or not in medical spending and outcomes (Lawsuits drive medical standards)

The federal government's willingness to reimburse you for spending. Any patent care administered while the system was down and not recording is not paid for, and while people have downtime procedures the risk of missing out on some juicy procedures or pills or other things means this can add up quickly.

That's fine, BUT the ONLY thing we know for certain is what they were willing to implement previously. We don't know what kind of medicine the work in, what risks there are, what EMR dependencies there are. Sure they can't bill for twelve hours, but that might cost them nothing while uptime costs something. All depends. What we DO know is that they didn't have the hardware, planning, documentation, staff or support organizations for anything other than what they got. So based on the sole information that we have, we can't assume that their business believes in uptime. Even during the outage, they made it VERY clear that getting it fixed was not a priority but that status updates, conversations and even other IT needs were the priority.

We have a pretty uniform picture that uptime on this system is not perceived as important by the business decision makes, even during the panic fire of a real outage.

This type of argument is something I see you make all the time. Just because the system didn't perform in the manner that they wanted/needed - doesn't mean that they weren't trying to obtain it just the same. What it does mean is that whoever they hired to accomplish that goal lied to them (assuming that really was the goal).

If you're the business owner and don't know squat about IT, so you hire George the IT consultant - how is owner suppose to know that George did the job right or wrong? Unless you're telling me that the owner should be hiring a second consultant to look over George's work to make sure it was what the owner really wanted?

Dashrender

@scottalanmiller said in Domain Controller Down (VM):

I totally understand that there are medical situations where high availability and high uptime are considered necessary and make sense in a business context. And I totally agree that this has the potential to be one of them. I'm only saying that it being possible doesn't make it so and that all indications from reading back their previous decisions, investments and behaviour suggest that they do not agree with that assessment.

Again - read my previous post - Assuming the owner's aren't IT personal - how are they SUPPOSED to know? It was like John asking why WS didn't refresh the ISCSI connection instead of rebooting the whole switch - if he's never done it before, how's he suppose to know? All they can do is trust those that they hire to do what was asked.

scottalanmiller

@Dashrender said in Domain Controller Down (VM):

@scottalanmiller said in Domain Controller Down (VM):

@John-Nicholson said in Domain Controller Down (VM):

@scottalanmiller said in Domain Controller Down (VM):

@John-Nicholson said in Domain Controller Down (VM):

Its a medical facility that has beds occupied 24/7 so yes.

That doesn't mean that. We can equally say they didn't have 24x7 IT staff so they don't need it. What they need, we have no way of knowing. If we read back what we know about their environment, it tells us that they didn't think that they needed HA in any way whatsoever. But that's all we have to go on. They operate around the clock, but that isn't an HA concern. And they implemented something so far from HA that it is laughable. So all we know is that they implemented anti-HA and spent a lot to do it. That's it. We have no indication that HA is warranted in any way.

Just because a shop is 24x7 medical doesn't tell us that a specific system is needed 24x7 or that it needs to be available at all times. Those are very different requirements.

EMR's on the system, I've yet to meet a medical facility who's SLA accepts a 12 hour outage for that, that has 24/7 manned beds, and ultimately its two things that drive medical standards and outcomes.

In America, a Jury in a wrongful death situation that are the arbitrators of what was acceptable or not in medical spending and outcomes (Lawsuits drive medical standards)

The federal government's willingness to reimburse you for spending. Any patent care administered while the system was down and not recording is not paid for, and while people have downtime procedures the risk of missing out on some juicy procedures or pills or other things means this can add up quickly.

That's fine, BUT the ONLY thing we know for certain is what they were willing to implement previously. We don't know what kind of medicine the work in, what risks there are, what EMR dependencies there are. Sure they can't bill for twelve hours, but that might cost them nothing while uptime costs something. All depends. What we DO know is that they didn't have the hardware, planning, documentation, staff or support organizations for anything other than what they got. So based on the sole information that we have, we can't assume that their business believes in uptime. Even during the outage, they made it VERY clear that getting it fixed was not a priority but that status updates, conversations and even other IT needs were the priority.

We have a pretty uniform picture that uptime on this system is not perceived as important by the business decision makes, even during the panic fire of a real outage.

This type of argument is something I see you make all the time. Just because the system didn't perform in the manner that they wanted/needed - doesn't mean that they weren't trying to obtain it just the same. What it does mean is that whoever they hired to accomplish that goal lied to them (assuming that really was the goal).

I didn't say that it did. I said that it was the only information that we have and that every decision both planned and triage pointed to the same conclusion - that they don't care about uptime. That's it, period. ANYTHING other than this is someone here injecting personal opinion into the mix. Pushing HA where no HA is suggested. We have no reason to suspect that they ever felt that HA was going to happen. That's an assumption based on nothing at all.

That doesn't mean that they didn't, it only means that there is zero evidence to suggest it. All evidence that we have points away. It's that simple. They took no actions towards HA, they didn't state that they wanted HA, they didn't provide documentation as to why HA would be needed, they didn't behave in an HA way.

scottalanmiller

@Dashrender said in Domain Controller Down (VM):

If you're the business owner and don't know squat about IT, so you hire George the IT consultant - how is owner suppose to know that George did the job right or wrong? Unless you're telling me that the owner should be hiring a second consultant to look over George's work to make sure it was what the owner really wanted?

You are only making an argument for why the evidence that they don't want HA is not very strong. I never said that it was. You are not even slightly making an argument that they wanted HA, only that we don't know much based on the evidence based on the assumption that they CEO is an moron and can't do his job. Other than that being a moderately safe assumption as it is generally the case in SMBs, it tells us nothing. I never stated anything to the contrary, so pointing this out doesn't dispute my point.

scottalanmiller

What that DOES suggest, however, is that whatever body is in charge of the CEO felt that the CEO was able to do their job. At any step like this, you are assuming that the investors are not holding the board accountable, that the board is not holding the CEO accountable, that the CEO is not able to manage and hire at a business like level, that all decisions and oversight has been bad, that all management and results in real time are not what is wanted. Is all of that possible? Of course. But that is a lot of assumption to hoist onto a company based on zero evidence, just assumption that they "must want that."

Given that all the evidence points away from them feeling that HA was warranted in any way, that even SA was not needed, and that they continued to act that way during the outage, de-prioritized the outage, basically ignored the outage and have never suggested otherwise, I think it's early to make such a sweeping assumption.