PagerDuty (PD) is an alerting and monitoring service used by many of the largest companies in the world from Carnival Cruise lines to Cloudflare and Slack. PD connects to your favorite service providers (AWS, Oracle Cloud Infrastructure (OCI), GCP, Azure) or any other service that can invoke a webhook given to it when an alarm fires (for example). PD can take these sources of information, whether that be an alarm firing in Oracle Cloud or an application invoking a webhook due to it failing, and alert whoever is on-call through a variety of means necessary.
One of the best features of PD is that it can alert you in many more ways than just a simple email. Sometimes, if you want to be notified of an event that isn’t critical but nice to know, such as a server reaching 50% full on a disk, you can have PD send you an email with an alarm stating this, however, if you have a production server go down, PD can do much more. Inside PD, you can set up various services with different escalation policies depending on the service's severity or importance. If we return to the production server example, you can have any service that is deemed mission-critical under one grouping or "Service" in PD. If an alarm in Oracle Cloud Infrastructure says that this server is no longer reachable, PD can swing into action and begin alerting the engineer on-call. Again, this can be as simple as an email or as annoying (or useful) as SMS-ing their phone every 30 seconds, in-app push notifications every 10 seconds, and calling them every minute until they acknowledge the incident. PD even has methods to group many responders together, such as offering a place to put a slack channel on the incident, allowing all responders to join a specific Zoom call, or even providing a specific conference call number that can either be used to A) route to whoever is on-call at the moment or B) allow many responders to be on one phone call together to help engineer a solution.
PD’s integrations, possibilities, and usefulness are infinite, but what are some real examples of PD in action?
Let’s take on my admittedly novice use of the service to help me keep my services running. I am currently on PD’s free tier plan, which allows for unlimited API calls, 1 on-call schedule, 1 escalation policy, 100 SMS/phone credits per month, and a whole host of other perks for $0 per month for up to 5 users.
Currently, I have two groups in PD: one for critical infrastructure, such as the server hosting this website, and another for less-critical infrastructure, like my nextcloud instance or Plex Media Server instance. For all critical infrastructure, I have their alerts fire on the “high priority” setting, which immediately pushes the PD app on my phone, emails me, and if I don’t respond, repeats this process five times every two minutes until it finally tries to call me twice: once after 10 minutes and again 15 minutes after the incident was created. I then have it set to repeat this notification policy once more. For all other notifications that are not part of critical infrastructure, I have it follow the “low priority” notification setting, which emails me instantly, pushes my phone once every 5 minutes 4 times, and then ends. These notifications can be changed on a person-by-person basis in the notifications settings in the PD console.
These incidents are created when alarms set up in OCI begin firing. OCI has built-in PD integrations that allow seamless connectivity between OCI alarms and PD incidents. These alarms in OCI tend to monitor the VMs present through internal metrics like CPU usage, network activity, and instance state. Additional alarms are connected to health checks, which check the reachability of specific APIs and websites that I am hosting every 30 seconds. These alarms begin to fire once the health check fails 4 consecutive times, which signifies that the service has been down for 2 minutes. These alarms are all configurable in OCI, and the best part is that the method by which I created these alarms and their respective connections into PD are very similar across other IaaS providers like Azure, GCP, AWS, and even DO.
Finally, PD also has other wonderful perks that many other self-hosted uptime solutions, such as uptime karma, cannot do. Firstly, even on their free tier, PD offers zero maintenance windows ever. Now, that promise is very hard to keep, and considering the service is free on the free tier, there is no SLA. However, I am signed up for emails on PD’s status page, and they appear to have incidents about once a week; I have been using the service for about 2 years now and have never seen an incident where core infrastructure running their service is unavailable. Part of this is also due to the fact that they run several times the infrastructure needed to handle their load, as evident in this blog post. Additional features include integrations with git-based workflow tools such as Github, Gitlab, Bitbucket, and others, allowing responders to track recent changes to a specific repo in an incident’s details. Some other cool features also include the ability for PD to automatically group incidents together so that if one server running multiple services goes down, you won’t be hit with a flood of incidents all relating to that one server. PD’s automation allows them to intelligently understand which notifications may be related and group them so that responders can spend more time resolving the incident and less time acknowledging incidents in the PD dashboard. Finally, another helpful tool is the ability to look back on how engineers responded to the incident through the dashboard or chat with the ability to create post mortems inside the dashboard.
We have only just scratched the surface of what PD can do in hundreds of different situations. However, I hope that this overview has allowed you to understand PD and all that it has to offer.