PRTG Alternative...
-
@jimmy9008 - In PRTG, you can generate a ticket as part of the alert process. Why can't you use that to enter the reason for downtime?
I use PRTG and UptimeRobot. UR's closest interval is 1 min.
-
@wrx7m said in PRTG Alternative...:
@jimmy9008 - In PRTG, you can generate a ticket as part of the alert process. Why can't you use that to enter the reason for downtime?
I use PRTG and UptimeRobot. UR's closest interval is 1 min.
You probably could, but you cannot then use that to retrospectively report on uptime/downtime.
-
@jimmy9008 - So you are trying to categorize "types" of downtime?
-
@wrx7m said in PRTG Alternative...:
@jimmy9008 - So you are trying to categorize "types" of downtime?
Yes. And with ability to then add/remove some types to see different reports. For example, 0.0087% of downtime was line issue. 0.0045% was a bad release by development. We may exclude the latter from the report as it's not the fault of the infrastructure team, and for example, should not affect their yearly bonus. (As an example).
-
@jimmy9008 I can see why you would want this. If you are using a secondary ticketing system for all other issues, you can have it send an email alert with certain info and have the other ticketing system create a ticket and be tracked there. That is the only way I can think of automating most of it.
-
I imagine something can do this. Don't mind moving away from PRTG if needed.
Essentially, the infrastructure team get a bonus for 99.995% uptime or more. If less, no bonus. Development often run releases causing downtime, or even sometimes screw up and restart services without permission, causing downtime. I'd like to exclude them from reports to see if they get their bonus, can't see a way within PRTG.
If a server crashes, performance issues, or line drops then that would be included in the calc.
-
@jimmy9008 I can see the motivation to have these numbers. What ticketing system do you use outside of PRTG?
-
I'll see if Alertmanager has tag abilities for alerts. I know there are comments but not sure if you can sort by anything.
Having devs be able to change services in prod sounds like fun....
-
@stacksofplates lol
-
@wrx7m said in PRTG Alternative...:
@jimmy9008 I can see the motivation to have these numbers. What ticketing system do you use outside of PRTG?
We use helpscout.
-
@stacksofplates said in PRTG Alternative...:
I'll see if Alertmanager has tag abilities for alerts. I know there are comments but not sure if you can sort by anything.
Having devs be able to change services in prod sounds like fun....
It is indeed. They are supposed to deploy to develop, test, staging, then live. But sometimes they will just make a mistake etc. It's the same as an admin accidentally restarting the working server they are RDPd in to.
As far as the infrastructure team care, the OS and hardware and networking are under their remit. Any service Dev needs in a server to make the product work is Devs choice. They also have the choice to restart their services under their remit. They just don't care that doing so perhaps affects another teams bonus.
For example, they should be telling us when a deployment is planned so we can add planned maintenance for that time, but often forget. (Yes that's all a business problem, but it's still my problem as I can't currently prove using PRTG that downtime should be excluded from the team as I can't rerun the stats after the event)
-
@jimmy9008 - Interesting. Does helpscout allow you to specify a category of downtime?
-
@wrx7m said in PRTG Alternative...:
@jimmy9008 - Interesting. Does helpscout allow you to specify a category of downtime?
Yes. Within helpscout you could tag a record with say "Dev issue". But this is an entirely separate system to PRTG. Would not be sure how to incorporate the data together.
You could tag "Dev issue" and count the number of devices issues in a year. But that wouldn't tell you how much of the 0.006% downtime was due to that compared to any other. Helpscout has no understanding of the downtime data.
-
@jimmy9008 - Not ideal, but you could include a screenshot or log of the total downtime from PRTG in the helpscout and classify it as dev issue.
-
@wrx7m said in PRTG Alternative...:
@jimmy9008 - Not ideal, but you could include a screenshot or log of the total downtime from PRTG in the helpscout and classify it as dev issue.
It's quite a work around. Would be better with one system entirely.
-
@jimmy9008 - Absolutely
-
Here is a constant interval option (enterprise plan) -
-
@jimmy9008 said in PRTG Alternative...:
@stacksofplates said in PRTG Alternative...:
I'll see if Alertmanager has tag abilities for alerts. I know there are comments but not sure if you can sort by anything.
Having devs be able to change services in prod sounds like fun....
It is indeed. They are supposed to deploy to develop, test, staging, then live. But sometimes they will just make a mistake etc. It's the same as an admin accidentally restarting the working server they are RDPd in to.
As far as the infrastructure team care, the OS and hardware and networking are under their remit. Any service Dev needs in a server to make the product work is Devs choice. They also have the choice to restart their services under their remit. They just don't care that doing so perhaps affects another teams bonus.
Yeah that's crazy. That should be handled by something like Kubernetes or Nomad/Consul. Humans restarting in prod should be an emergency scenario. Let the orchestration tools do the work.
-
I'll look at Alertmanager when I get home. The comments section might be enough.
-
I just looked. The only place to add comments with Alertmanager are when an alert is silenced. I looked in Grafana as well and that might be of use. Grafana will let you set alerts on specific metrics and then you can set annotations on those alerts. Here's a sample graph with alerts (they're the red dotted line).
You can click on the alert and give an annotation.
Then when you hover over the alert you can see the annotations and tags.