Scheduling Simple Local Linux Reboots

scottalanmiller

@aaron said in Scheduling Simple Local Linux Reboots:

I'm not against automation. I think y'all got the wrong idea. I against screwing the on-call team on a Friday evening because servers are rebooting and nobody is actively watching the monitoring. I am not talking about watching them POST for goodness sakes.

Better to be on call and get called once in a blue moon than to have to spend Friday night always stuck staring and console screens. I'd much prefer to get paged once in a while (and it is RARE if you are admining well) than to just give up Friday nights for something silly like that. That would be horrible.

And how often is the outage something for the admin? If a config file borks a service, that's for the application team to fix, not the system admin. Why make the system admin look for some non-system mistake? Let the right team handle that.

scottalanmiller

@aaron said in Scheduling Simple Local Linux Reboots:

@JaredBusch someone watching the monitoring dashboard and someone being paged in the middle of a Friday night dinner are very different.

Indeed, getting to have Friday night dinner 95% of the time is the difference

JaredBusch

@aaron said in Scheduling Simple Local Linux Reboots:

@JaredBusch someone watching the monitoring dashboard and someone being paged in the middle of a Friday night dinner are very different.

First of all as we have exhaustively discussed in the other threads, IT hours are not 9 to 5. and if you cannot get past that, you need to find a new line of work.

Second, no one should even be monitoring a dashboard in the first place. You should trust your system to send alerts. You trust it because you test it under control during normal work time.

scottalanmiller

@JaredBusch said in Scheduling Simple Local Linux Reboots:

Second, no one should even be monitoring a dashboard in the first place. You should trust your system to send alerts. You trust it because you test it under control during normal work time.

Plus this is a good way to test that system, too. You could leave it on and let people get alerts that there is an outage, and that it is resolved. The person having dinner with the family could watch the alert, know to watch for it to clear, and see it clear and get to enjoy Friday night knowing that the system did what it was supposed to do.

scottalanmiller

In larger teams, you normally have 24x7 staff. So the reboot schedule goes to the current shift to monitor. It's only shops that lack round the clock scheduling that have this as a real issue, and if you don't have 24x7, shouldn't you be outsourcing to a shop that does if you really need that at all? I think that this normally (maybe not always) becomes a problem when you are dealing with layers and layers of other problems like not having enough IT staff to properly staff a department without causing unnecessary cost and risk and choosing not to outsource to an MSP/ITSP that could do this cost effectively.