Useful Tech: Pagerduty Overview and Review

Useful Tech: Pagerduty Overview and Review
Photo by Sigmund / Unsplash

PagerDuty (PD) is an alerting and monitoring service used my many of the largest companies in the world from Carnival Cruise lines to Cloudflare and Slack. PD connects to your favorite service providers (AWS, Oracle Cloud Infrastructure (OCI), GCP, Azure) or any other service that can invoke a webhook given to it when an alarm fires (for example). PD is able to take these sources of information, whether that be an alarm firing in Oracle Cloud or an application invoking a webhook due to it failing, and alert whoever is on-call through a variety of means necessary.

One of the best features of PD is that it can alert you in many more ways than just a simple email. Sometimes, if you want to be notified of an event that isn’t critical but nice to know, such as a server reaching 50% full on a disk, you can absolutely have PD send you an email with an alarm stating this, however if you have a production server go down, PD can do much more. Inside PD, you can set up various services with different escalation policies which can depend on the severity or importance of said service. If we return to the production server example, you can have any service that is deemed mission critical under one grouping or "Service" in PD. If an alarm fires in Oracle Cloud Infrastructure saying that this server is no longer reachable, PD can swing into action and begin alerting the engineer on-call. Again, this can be as simple as an email or as annoying (or useful) as SMS-ing their phone every 30 seconds, in-app push notifications every 10 seconds, and calling them every minute until they acknowledge the incident. PD even has methods to group many responders together, such as offering a place to put a slack channel on the incident, allowing all responders to join a specific zoom call, or even providing a specific conference call number which can either be used to A) route to whoever is on-call at the moment or B) allow many responders to be on one phone call together to help engineer a solution.

PD’s integrations, possibilities, and usefulness are infinite, but what are some real examples of PD in action?

Let’s take on my admittedly novice use of the service to help me keep my services running. I am currently on PD’s free tier plan, which allows for unlimited API calls, 1 on-call schedule, 1 escalation policy, 100 SMS/phone credits per month, and a whole host of other perks for $0 per month for up to 5 users.

Currently, I have two groups in PD: one for critical infrastructure, such as the server that is hosting this website, and another for less-critical infrastructure, like my nextcloud instance or Plex Media Server instance. For all critical infrastructure, I have their alerts fire on the “high priority” setting which immediately pushes the PD app on my phone, emails me, and if I don’t respond, repeats this process 5 times every 2 minutes until it finally tries to call me twice: once after 10 minutes and again 15 minutes after the incident was created. I then have it set to repeat this notification policy once more. For all other notifications that are not part of critical infrastructure, I have it follow the “low priority” notification setting which emails me instantly, and pushes my phone once every 5 minutes 4 times and then ends. These notifications can be changed on a person-by-person basis in the notifications settings in the PD console.

These incidents are created when alarms set up in OCI begin firing. OCI has built in PD integrations which allow seamless connectivity between OCI alarms and PD incidents. These alarms in OCI tend to monitor the VMs present through internal metrics like CPU usage, network activity, and instance state. Additional alarms are connected to health checks which check the reachability of certain APIs and websites that I am hosting every 30 seconds. These alarms begin to fire once the health check fails for 4 consecutive times, which signifies that the service has been down for 2 minutes. These alarms are all configureable in OCI, and the best part is that the method by which I created these alarms and their respective connections into PD are very similar across other IaaS providers like Azure, GCP, AWS, and even DO.

Finally, PD also has some other wonderful perks that many other self-hosted uptime solutions, such as uptime karma cannot do. Firstly, even on their free tier, PD offers zero maintenance windows ever. Now, that promise us very hard to keep and considering the service is free on the free tier, there is no SLA. However, I am signed up for emails on PD’s status page and they do appear to have incidents about once a week, I have been using the service for about 6 months now and have never seen an incident where core infrastructure that runs their service is unavailable. Part of this is also due to the fact that they run several times the infrastructure needed to handle their load, evident in this blog post. Additional features offered include integrations with git-based workflow tools such as Github, Gitlab, Bitbucket, and others which allow responders to track recent changes to a certain repo in an incident’s details. Some other cool features also include the ability for PD to automatically group incidents together so that if one server running multiple services goes down, you won’t be hit with a flood of incidents all relating to that one server. PD’s automation allows them to inteligently understand which notifications may be related and group them so that responders can spend more time resolving the inncident and less time acknowledging incidents in the PD dashboard. Finially, another helpful tool is the abilty to look back on how engineers responded to the incident through the dashbaord or chat with the ability to create post mortems inside the dashboard.

We have only just scratched the surface of what PD can do in hundreds of different situations, however I hope that this overview has allowed you to understand PD and all that it has to offer.