PRTG Alternative...

Jimmy9008

@jimmy9008 I can see the motivation to have these numbers. What ticketing system do you use outside of PRTG?

We use helpscout.

Jimmy9008

@stacksofplates said in PRTG Alternative...:

I'll see if Alertmanager has tag abilities for alerts. I know there are comments but not sure if you can sort by anything.

Having devs be able to change services in prod sounds like fun....

It is indeed. They are supposed to deploy to develop, test, staging, then live. But sometimes they will just make a mistake etc. It's the same as an admin accidentally restarting the working server they are RDPd in to.

As far as the infrastructure team care, the OS and hardware and networking are under their remit. Any service Dev needs in a server to make the product work is Devs choice. They also have the choice to restart their services under their remit. They just don't care that doing so perhaps affects another teams bonus.

For example, they should be telling us when a deployment is planned so we can add planned maintenance for that time, but often forget. (Yes that's all a business problem, but it's still my problem as I can't currently prove using PRTG that downtime should be excluded from the team as I can't rerun the stats after the event)

wrx7m

@jimmy9008 - Interesting. Does helpscout allow you to specify a category of downtime?

Jimmy9008

@wrx7m said in PRTG Alternative...:

@jimmy9008 - Interesting. Does helpscout allow you to specify a category of downtime?

Yes. Within helpscout you could tag a record with say "Dev issue". But this is an entirely separate system to PRTG. Would not be sure how to incorporate the data together.

You could tag "Dev issue" and count the number of devices issues in a year. But that wouldn't tell you how much of the 0.006% downtime was due to that compared to any other. Helpscout has no understanding of the downtime data.

wrx7m

@jimmy9008 - Not ideal, but you could include a screenshot or log of the total downtime from PRTG in the helpscout and classify it as dev issue.

Jimmy9008

@wrx7m said in PRTG Alternative...:

@jimmy9008 - Not ideal, but you could include a screenshot or log of the total downtime from PRTG in the helpscout and classify it as dev issue.

It's quite a work around. Would be better with one system entirely.

wrx7m

@jimmy9008 - Absolutely

wrx7m

@jimmy9008

Here is a constant interval option (enterprise plan) -

https://www.statuscake.com/pricing/

stacksofplates

@jimmy9008 said in PRTG Alternative...:

@stacksofplates said in PRTG Alternative...:

I'll see if Alertmanager has tag abilities for alerts. I know there are comments but not sure if you can sort by anything.

Having devs be able to change services in prod sounds like fun....

It is indeed. They are supposed to deploy to develop, test, staging, then live. But sometimes they will just make a mistake etc. It's the same as an admin accidentally restarting the working server they are RDPd in to.

As far as the infrastructure team care, the OS and hardware and networking are under their remit. Any service Dev needs in a server to make the product work is Devs choice. They also have the choice to restart their services under their remit. They just don't care that doing so perhaps affects another teams bonus.

Yeah that's crazy. That should be handled by something like Kubernetes or Nomad/Consul. Humans restarting in prod should be an emergency scenario. Let the orchestration tools do the work.

stacksofplates

I'll look at Alertmanager when I get home. The comments section might be enough.

stacksofplates

I just looked. The only place to add comments with Alertmanager are when an alert is silenced. I looked in Grafana as well and that might be of use. Grafana will let you set alerts on specific metrics and then you can set annotations on those alerts. Here's a sample graph with alerts (they're the red dotted line).

You can click on the alert and give an annotation.

Then when you hover over the alert you can see the annotations and tags.