blog

Working with multiple on-call teams using Zabbix and iLert

05 Sep, 2020

Note: This is a repost of the original article at the Zabbix Blog.

This post outlines how to use Zabbix and iLert with multiple on-call teams, where each team is responsible for a set of host groups in Zabbix, and therefore, will only receive alerts for the services it is responsible for. But first, let’s start with the basic needs when being on-call.

The needs of an on-call engineer

As the guardians of productions systems, the most fundamental need of an on-call engineer is the ability to detect that something is broken or about to break and an effective way to bring human attention to look after the systems. A major benefit of a monitoring tool like Zabbix is that it lets you disengage from the systems that you monitor and the monitoring system itself. Once you define the conditions that need immediate attention from a human, you can rely on the passive monitoring of your network, hosts and applications to watch for changing conditions and have the right person alerted. While Zabbix excels in the monitoring part, it defers the responsibility of alerting the right people to dedicated alerting solutions through out-of-the-box integrations. Like any monitoring tool, Zabbix doesn’t provide capabilities that are required to page on-call engineers, such as alerting via voice call, frictionless alert acknowledgement, managing on-call schedules, automatic escalations. Some organizations simply send emails to the entire team for urgent alerts, which often results in nobody taking responsibility and ignoring the emails. Besides, email is the worst alerting mechanism and should never be used as the primary alerting method.

Dedicated alerting systems extend monitoring tools with advanced alerting and on-call management capabilities. One tool that works out-of-box with Zabbix is iLert. It is included as a media type in Zabbix 5.x . And for Zabbix 4.4+, it can be imported as a media type from the Zabbix GitHub repository.

What is iLert?

iLert is an alerting and on-call management solution for ops teams and helps you to respond to incidents faster. It extends monitoring tools such as Zabbix with advanced alerting through SMS, phone calls, and push notifications and lets you manage on-call duty with schedules and escalations.

With iLert’s Zabbix integration, you can automatically create incidents in iLert based on triggers in Zabbix and alert the on-call person through multiple channels, such as phone calls, SMS, push notifications, Slack, Microsoft Teams and more. Core features of iLert include:

  • Actionable alerts: you can acknowledge or escalate an incident on the same channel where you receive the alert, e.g. by replying to an SMS.
  • On-call schedules and automatic escalations: Always alert the right person and share on-call responsibility across your team with on-call schedules and automatic escalations.
  • Define alerting rules based on support hours and delay alerts until support hours start.

Setting up alerting with multiple teams

Let’s assume we have two teams, A and B who are responsible for a bunch of hosts and applications running on those hosts. We’re going to group all hosts of a team into host groups, create a user group for each team, and use both host and user groups to assign permissions to hosts for different teams. Since the individual team members, along with their contact data, on-call schedules, and escalation rules are defined and managed in iLert, we’re not going to create a Zabbix user for every team member. Instead, we’re going to create a single user for every team. The user will be connected with the corresponding alert source in iLert. An alert source in turn is linked with the right team members and will make sure to notify the right team member using on-call schedules and escalation policies. The image below illustrated how our resulting setup will look like:

heartbeat

Now let’s implement this scenario step-by-step.

In iLert: create two alert sources of type Zabbix.

  1. Go to Alert sources and click on Add a new alert source.
    heartbeat
  2. Set a name (e.g. “Team A Zabbix”) and select Team A’s the escalation policy and set the Integration type to Zabbix:
    heartbeat
  3. Click Save. An API key is generated on the next page. You will need the API in the next section.
    heartbeat
  4. Repeat steps 1-3 and create an alert source for Team B.

In Zabbix: create host and user groups, and users

  1. Go to Configuration –> Host groups

    heartbeat

  2. Create two host groups Team A servers and Team B servers and assign the hosts of each team to the corresponding host group.

  3. Go to Administration –> User groups

  4. Create two user groups Team A and Team B and assign each group Read permissions to their respective host group:

  5. For each team, create a user iLert Team A alert source and iLert Team B alert source

    • Assign the user to user group Team A
    • Switch to Media tab and add iLert as media type.
    • Enter the alert source’s API key in the Send to field
    • Click on Add to create the user.
  6. Repeat the steps 3-5 and create a user group and user for Team B.

In Zabbix: Create trigger actions

  1. Create an action and give it a name (e.g. Notify iLert)
  2. Switch to Operations tab and add the following operations under Operations, Recovery operations, Update operations:
    • Add both user groups Team A and Team B in the Send to user group field.
    • Chose iLert in the Default media type field
    • The resulting operations view should look like this

But wait… Wouldn’t Team A get notifications for Team B’s problems and vice versa? No, since both user groups have only read permissions for the host groups they are responsible for, they will only receive notifications related to their own host groups.

However, you might want to consider creating separate trigger actions for each team. For example if you want define different conditions for the trigger action or if you have a large number of teams and you want to keep things separated for the sake of maintainability.

That’s it!

Conclusion

Both teams will now be automatically notified for problems in their services. Incidents in iLert will be automatically closed, when the problem in Zabbix is recovered. And everything that is related to responding an incident is managed and handled by iLert, including alerting the team member, the alerting channels to use, when to escalate, and potentially engaging other stakeholders through iLert’s stakeholder engagement feature.

Below is an example incident from iLert created by Zabbix:

The incident will include a back link to the Zabbix Event Details page and any relevant items sent by Zabbix. Events in Zabbix also include a link to the incident in iLert: