Note: This is a repost of the original article at the Zabbix Blog.
This post outlines how to use Zabbix and iLert with multiple on-call teams, where each team is responsible for a set of host groups in Zabbix, and therefore, will only receive alerts for the services it is responsible for. But first, let’s start with the basic needs when being on-call.
As the guardians of productions systems, the most fundamental need of an on-call engineer is the ability to detect that something is broken or about to break and an effective way to bring human attention to look after the systems. A major benefit of a monitoring tool like Zabbix is that it lets you disengage from the systems that you monitor and the monitoring system itself. Once you define the conditions that need immediate attention from a human, you can rely on the passive monitoring of your network, hosts and applications to watch for changing conditions and have the right person alerted. While Zabbix excels in the monitoring part, it defers the responsibility of alerting the right people to dedicated alerting solutions through out-of-the-box integrations. Like any monitoring tool, Zabbix doesn’t provide capabilities that are required to page on-call engineers, such as alerting via voice call, frictionless alert acknowledgement, managing on-call schedules, automatic escalations. Some organizations simply send emails to the entire team for urgent alerts, which often results in nobody taking responsibility and ignoring the emails. Besides, email is the worst alerting mechanism and should never be used as the primary alerting method.
Dedicated alerting systems extend monitoring tools with advanced alerting and on-call management capabilities. One tool that works out-of-box with Zabbix is iLert. It is included as a media type in Zabbix 5.x . And for Zabbix 4.4+, it can be imported as a media type from the Zabbix GitHub repository.
iLert is an alerting and on-call management solution for ops teams and helps you to respond to incidents faster. It extends monitoring tools such as Zabbix with advanced alerting through SMS, phone calls, and push notifications and lets you manage on-call duty with schedules and escalations.
With iLert’s Zabbix integration, you can automatically create incidents in iLert based on triggers in Zabbix and alert the on-call person through multiple channels, such as phone calls, SMS, push notifications, Slack, Microsoft Teams and more. Core features of iLert include:
Let’s assume we have two teams, A and B who are responsible for a bunch of hosts and applications running on those hosts. We’re going to group all hosts of a team into host groups, create a user group for each team, and use both host and user groups to assign permissions to hosts for different teams. Since the individual team members, along with their contact data, on-call schedules, and escalation rules are defined and managed in iLert, we’re not going to create a Zabbix user for every team member. Instead, we’re going to create a single user for every team. The user will be connected with the corresponding alert source in iLert. An alert source in turn is linked with the right team members and will make sure to notify the right team member using on-call schedules and escalation policies. The image below illustrated how our resulting setup will look like:
Now let’s implement this scenario step-by-step.
Set a name (e.g. “Team A Zabbix”) and select Team A’s the escalation policy and set the Integration type to Zabbix:
Click Save. An API key is generated on the next page. You will need the API in the next section.
Repeat steps 1-3 and create an alert source for Team B.
Go to Administration –> User groups
Create two user groups Team A and Team B and assign each group Read permissions to their respective host group:
For each team, create a user iLert Team A alert source and iLert Team B alert source
Switch to Operations tab and add the following operations under Operations, Recovery operations, Update operations:
The resulting operations view should look like this
But wait… Wouldn’t Team A get notifications for Team B’s problems and vice versa? No, since both user groups have only read permissions for the host groups they are responsible for, they will only receive notifications related to their own host groups.
However, you might want to consider creating separate trigger actions for each team. For example if you want define different conditions for the trigger action or if you have a large number of teams and you want to keep things separated for the sake of maintainability.
Both teams will now be automatically notified for problems in their services. Incidents in iLert will be automatically closed, when the problem in Zabbix is recovered. And everything that is related to responding an incident is managed and handled by iLert, including alerting the team member, the alerting channels to use, when to escalate, and potentially engaging other stakeholders through iLert’s stakeholder engagement feature.
Below is an example incident from iLert created by Zabbix:
The incident will include a back link to the Zabbix Event Details page and any relevant items sent by Zabbix. Events in Zabbix also include a link to the incident in iLert: