PRTG Alternative...



  • Hi folks,

    Any suggestions on website monitoring/software? Other than one issue PRTG works totally fine.

    We check HTTP to our sites every 10 seconds. Once we have downtime, we investigate, find the cause and would like to tag within PRTG the reason for the downtime.

    Sadly, I cannot see a way to do so. Do you know any software that can? I would like to be able to run reports and exclude certain events based on the tag etc.

    Best,
    Jim



  • Not sure if this fits your needs 100% but certainly works for my needs. I use uptimerobot.com. They have a free and paid version but I only needed the free version. If a site does not respond to pings, it sends an email to my ticketing software to open a ticket for me to investigate. It can do more than just verify ping as well such as look for content on the page and flag if found or not found...etc.

    Of course, this only works if your sites are externally available.



  • @nashbrydges

    @nashbrydges said in PRTG Alternative...:

    Not sure if this fits your needs 100% but certainly works for my needs. I use uptimerobot.com. They have a free and paid version but I only needed the free version. If a site does not respond to pings, it sends an email to my ticketing software to open a ticket for me to investigate. It can do more than just verify ping as well such as look for content on the page and flag if found or not found...etc.

    Of course, this only works if your sites are externally available.

    Thanks, i'll take a look. Seems to only be a check every 5 minutes. We check every 10 seconds, is that possible with this? It doesnt look like I can filter or tag though 😞



  • The Pro plan can check every minute but I think that's the shortest interval available.

    As for tagging, if you send emails to your ticketing system, you could use the email attributes to create ticket metadata tags for reporting and filtering from there. Reporting from within the website is limited but since I only use the free version, that may be why it's limited.



  • @nashbrydges

    @nashbrydges said in PRTG Alternative...:

    The Pro plan can check every minute but I think that's the shortest interval available.

    As for tagging, if you send emails to your ticketing system, you could use the email attributes to create ticket metadata tags for reporting and filtering from there. Reporting from within the website is limited but since I only use the free version, that may be why it's limited.

    Is the free plan still free for business use? Or does that have to be pro? )If you know of course)



  • @jimmy9008 I haven't read the ToS in some time but can't remember restrictions about personal vs business on the free plan. Just limited to 50 alert types.



  • @jimmy9008 - In PRTG, you can generate a ticket as part of the alert process. Why can't you use that to enter the reason for downtime?

    I use PRTG and UptimeRobot. UR's closest interval is 1 min.



  • @wrx7m said in PRTG Alternative...:

    @jimmy9008 - In PRTG, you can generate a ticket as part of the alert process. Why can't you use that to enter the reason for downtime?

    I use PRTG and UptimeRobot. UR's closest interval is 1 min.

    You probably could, but you cannot then use that to retrospectively report on uptime/downtime.



  • @jimmy9008 - So you are trying to categorize "types" of downtime?



  • @wrx7m said in PRTG Alternative...:

    @jimmy9008 - So you are trying to categorize "types" of downtime?

    Yes. And with ability to then add/remove some types to see different reports. For example, 0.0087% of downtime was line issue. 0.0045% was a bad release by development. We may exclude the latter from the report as it's not the fault of the infrastructure team, and for example, should not affect their yearly bonus. (As an example).



  • @jimmy9008 I can see why you would want this. If you are using a secondary ticketing system for all other issues, you can have it send an email alert with certain info and have the other ticketing system create a ticket and be tracked there. That is the only way I can think of automating most of it.



  • I imagine something can do this. Don't mind moving away from PRTG if needed.

    Essentially, the infrastructure team get a bonus for 99.995% uptime or more. If less, no bonus. Development often run releases causing downtime, or even sometimes screw up and restart services without permission, causing downtime. I'd like to exclude them from reports to see if they get their bonus, can't see a way within PRTG.

    If a server crashes, performance issues, or line drops then that would be included in the calc.



  • @jimmy9008 I can see the motivation to have these numbers. What ticketing system do you use outside of PRTG?



  • I'll see if Alertmanager has tag abilities for alerts. I know there are comments but not sure if you can sort by anything.

    Having devs be able to change services in prod sounds like fun....





  • @wrx7m said in PRTG Alternative...:

    @jimmy9008 I can see the motivation to have these numbers. What ticketing system do you use outside of PRTG?

    We use helpscout.



  • @stacksofplates said in PRTG Alternative...:

    I'll see if Alertmanager has tag abilities for alerts. I know there are comments but not sure if you can sort by anything.

    Having devs be able to change services in prod sounds like fun....

    It is indeed. They are supposed to deploy to develop, test, staging, then live. But sometimes they will just make a mistake etc. It's the same as an admin accidentally restarting the working server they are RDPd in to.

    As far as the infrastructure team care, the OS and hardware and networking are under their remit. Any service Dev needs in a server to make the product work is Devs choice. They also have the choice to restart their services under their remit. They just don't care that doing so perhaps affects another teams bonus.

    For example, they should be telling us when a deployment is planned so we can add planned maintenance for that time, but often forget. (Yes that's all a business problem, but it's still my problem as I can't currently prove using PRTG that downtime should be excluded from the team as I can't rerun the stats after the event)



  • @jimmy9008 - Interesting. Does helpscout allow you to specify a category of downtime?



  • @wrx7m said in PRTG Alternative...:

    @jimmy9008 - Interesting. Does helpscout allow you to specify a category of downtime?

    Yes. Within helpscout you could tag a record with say "Dev issue". But this is an entirely separate system to PRTG. Would not be sure how to incorporate the data together.

    You could tag "Dev issue" and count the number of devices issues in a year. But that wouldn't tell you how much of the 0.006% downtime was due to that compared to any other. Helpscout has no understanding of the downtime data.



  • @jimmy9008 - Not ideal, but you could include a screenshot or log of the total downtime from PRTG in the helpscout and classify it as dev issue.



  • @wrx7m said in PRTG Alternative...:

    @jimmy9008 - Not ideal, but you could include a screenshot or log of the total downtime from PRTG in the helpscout and classify it as dev issue.

    It's quite a work around. Would be better with one system entirely.



  • @jimmy9008 - Absolutely



  • @jimmy9008

    Here is a constant interval option (enterprise plan) -

    https://www.statuscake.com/pricing/



  • @jimmy9008 said in PRTG Alternative...:

    @stacksofplates said in PRTG Alternative...:

    I'll see if Alertmanager has tag abilities for alerts. I know there are comments but not sure if you can sort by anything.

    Having devs be able to change services in prod sounds like fun....

    It is indeed. They are supposed to deploy to develop, test, staging, then live. But sometimes they will just make a mistake etc. It's the same as an admin accidentally restarting the working server they are RDPd in to.

    As far as the infrastructure team care, the OS and hardware and networking are under their remit. Any service Dev needs in a server to make the product work is Devs choice. They also have the choice to restart their services under their remit. They just don't care that doing so perhaps affects another teams bonus.

    Yeah that's crazy. That should be handled by something like Kubernetes or Nomad/Consul. Humans restarting in prod should be an emergency scenario. Let the orchestration tools do the work.



  • I'll look at Alertmanager when I get home. The comments section might be enough.



  • I just looked. The only place to add comments with Alertmanager are when an alert is silenced. I looked in Grafana as well and that might be of use. Grafana will let you set alerts on specific metrics and then you can set annotations on those alerts. Here's a sample graph with alerts (they're the red dotted line).

    0_1532217196124_alerts.png

    You can click on the alert and give an annotation.

    0_1532217291469_annotation.png

    Then when you hover over the alert you can see the annotations and tags.

    0_1532217335973_annotation-alert.png


Log in to reply