Note: This is a repost of the original article at the Zabbix Blog.
This post outlines how to use Zabbix and iLert with multiple on-call teams, where each team is responsible for a set of host groups in Zabbix, and therefore, will only receive alerts for the services it is responsible for. But first, let’s start with the basic needs when being on-call.
As the guardians of productions systems, the most fundamental need of an on-call engineer is the ability to detect that something is broken or about to break and an effective way to bring human attention to look after the systems. A major benefit of a monitoring tool like Zabbix is that it lets you disengage from the systems that you monitor and the monitoring system itself. Once you define the conditions that need immediate attention from a human, you can rely on the passive monitoring of your network, hosts and applications to watch for changing conditions and have the right person alerted. While Zabbix excels in the monitoring part, it defers the responsibility of alerting the right people to dedicated alerting solutions through out-of-the-box integrations. Like any monitoring tool, Zabbix doesn’t provide capabilities that are required to page on-call engineers, such as alerting via voice call, frictionless alert acknowledgement, managing on-call schedules, automatic escalations. Some organizations simply send emails to the entire team for urgent alerts, which often results in nobody taking responsibility and ignoring the emails. Besides, email is the worst alerting mechanism and should never be used as the primary alerting method.
Dedicated alerting systems extend monitoring tools with advanced alerting and on-call management capabilities. One tool that works out-of-box with Zabbix is iLert. It is included as a media type in Zabbix 5.x . And for Zabbix 4.4+, it can be imported as a media type from the Zabbix GitHub repository.
iLert is an alerting and on-call management solution for ops teams and helps you to respond to incidents faster. It extends monitoring tools such as Zabbix with advanced alerting through SMS, phone calls, and push notifications and lets you manage on-call duty with schedules and escalations.
With iLert’s Zabbix integration, you can automatically create incidents in iLert based on triggers in Zabbix and alert the on-call person through multiple channels, such as phone calls, SMS, push notifications, Slack, Microsoft Teams and more. Core features of iLert include:
Let’s assume we have two teams, A and B who are responsible for a bunch of hosts and applications running on those hosts. We’re going to group all hosts of a team into host groups, create a user group for each team, and use both host and user groups to assign permissions to hosts for different teams. Since the individual team members, along with their contact data, on-call schedules, and escalation rules are defined and managed in iLert, we’re not going to create a Zabbix user for every team member. Instead, we’re going to create a single user for every team. The user will be connected with the corresponding alert source in iLert. An alert source in turn is linked with the right team members and will make sure to notify the right team member using on-call schedules and escalation policies. The image below illustrated how our resulting setup will look like:
Now let’s implement this scenario step-by-step.
Set a name (e.g. “Team A Zabbix”) and select Team A’s the escalation policy and set the Integration type to Zabbix:
Click Save. An API key is generated on the next page. You will need the API in the next section.
Repeat steps 1-3 and create an alert source for Team B.
Go to Administration –> User groups
Create two user groups Team A and Team B and assign each group Read permissions to their respective host group:
For each team, create a user iLert Team A alert source and iLert Team B alert source
Switch to Operations tab and add the following operations under Operations, Recovery operations, Update operations:
The resulting operations view should look like this
But wait… Wouldn’t Team A get notifications for Team B’s problems and vice versa? No, since both user groups have only read permissions for the host groups they are responsible for, they will only receive notifications related to their own host groups.
However, you might want to consider creating separate trigger actions for each team. For example if you want define different conditions for the trigger action or if you have a large number of teams and you want to keep things separated for the sake of maintainability.
Both teams will now be automatically notified for problems in their services. Incidents in iLert will be automatically closed, when the problem in Zabbix is recovered. And everything that is related to responding an incident is managed and handled by iLert, including alerting the team member, the alerting channels to use, when to escalate, and potentially engaging other stakeholders through iLert’s stakeholder engagement feature.
Below is an example incident from iLert created by Zabbix:
The incident will include a back link to the Zabbix Event Details page and any relevant items sent by Zabbix. Events in Zabbix also include a link to the incident in iLert:
Keep it quiet during maintenance. With maintenance windows, you can temporarily disable one or multiple alert sources for a set period of time. An alert source with an ongoing maintenance will not create any incidents.
With Priority-Based Alerting, you can set different notification rules for high and low priority incidents. That way, you can use more obtrusive alerting methods for incidents that require immediate attention and less obtrusive methods for the ones that don’t. Low priority incident also do not escalate automatically. The incident priority is set on the alert source level. Moreover, the incident priority can be dynamically set based on Support Hours. This gives you the flexibility to be notified differently based on the time of day. You could, for example, configure your alert source to receive voice alerts during the night and push notifications only during work hours.