Less is more: Incident management and monitoring in hybrid IT infrastructures

Elias Völker
August 2, 2021
Table of Contents:

Many companies are continuously modernizing their infrastructure – but there is no standard way for the perfect IT infrastructure. Still, hybrid architectures have become the status quo in enterprises. Almost all organizations have migrated at least parts of their assets to the cloud or run applications as cloud services. At the same time, businesses want to dovetail their IT architecture with software development and are therefore embracing dynamic infrastructures.

As a part of these new developments, IT teams are also adjusting their monitoring approaches. On the one hand, on-premises systems continue to be monitored by system administrators using established monitoring solutions – while on the other hand, cloud specialists, platform engineers and developers are increasingly working in dynamic infrastructures and using their own monitoring tools, which are better suited to their individual requirements. They use these tools to monitor specific applications and individual performance metrics that are relevant to their area of specialization.

In principle, the use of multiple monitoring solutions is quite legitimate in order to respond to the differing requirements of the individual teams. Nevertheless, there is a danger that without proper coordination and integration of the solutions, information silos will emerge and incident management will become increasingly time-consuming. This can result in longer incident handling times and shorter availability times.

The challenge lies in the fact that monitoring responsibilities are usually split in a modern IT infrastructure, but interconnected IT systems are still interdependent. While system administrators are still responsible for monitoring on-premises infrastructure, the responsibility for cloud and container infrastructure lies with other teams such as DevOps specialists or developers. The areas that are separated in monitoring are nevertheless often dependent on each other. Some problems can therefore only be solved with inter-team collaboration.

Solutions such as iLert are an important aid in transmitting alerts from a wide variety of sources to the right parties. At the same time, the information from an alert is not always sufficient to fix a problem. Instead, the responsible employee needs access to more detailed contextual information in order to be able to initiate the right countermeasures or provide information to a colleague from another team. If they do not find the details in the monitoring tool they are using, they have to obtain them by other means, which can cost valuable time.

Monitoring in the hybrid world: Things to consider

A first step is therefore to review the monitoring solutions currently in use and consolidate them. As mentioned, it is important that IT teams can use monitoring tools that are suitable for their operational needs. However, inadequate solutions are often implemented or supplemented with additional tools. The more monitoring solutions companies use in parallel, however, the greater the risk of information silos developing and teams not being able to share information efficiently.

Figure 1: With the appropriate monitoring tool, the impact of an incident on other systems can be quickly identified.

One example would be the monitoring of servers and networks by multiple teams. In such scenarios, it is worthwhile to unify the monitoring into a single tool, so that the relationships and interactions of individual components can be identified immediately, if necessary. If a switch is overloaded, for example, the connected servers cannot function properly. When using multiple monitoring tools, network and server administrators receive an alert, either directly from their monitoring tool or via iLert. Despite this, however, they may not see the full extent of the problem.

Modern IT infrastructure monitoring solutions such as Checkmk come bundled with integrations for various systems and can also granularly control access and administration rights for the individual resources/components and the support responsible for these via flexible user management. In the concrete example referred to above, the server administrator would immediately recognize that his server is actually working properly and that the problem actually lies with the switch.

In addition, many monitoring approaches used by system administrators are not designed for cloud technologies and dynamic systems. Often monitoring teams try to use outdated tools to monitor modern systems. This makes monitoring inadequate and problem resolution particularly inefficient.

Figure 2: When choosing a monitoring tool, look for support from popular cloud providers such as AWS

Intelligent integration of the necessary monitoring tools

Of course, the requirements of individual IT teams are too varied to be able to map them all in a single monitoring tool. DevOps managers, for example, have differing requirements from classic system administrators. Agile software development is more about checking targets and metrics for individual applications, while classic infrastructure monitoring focuses on the monitoring of all systems. A system admin wants to be able to quickly include assets in the monitoring and to minimize the manual handling of each system.

It is therefore unlikely that organizations will be able to satisfy the various needs of the different IT teams with only a single monitoring tool. This is why integrations between individual monitoring tools are a good approach. Not only do they allow for an automatic exchange of data between diverse tools, but at the same time they improve the ability of different IT teams to interact, especially in the remediation of any problems that may be identified.

The combination of Prometheus and Checkmk is an example of a practical integration of two monitoring solutions, making it ideal for collaboration between development teams and IT operations. Prometheus is a popular tool with DevOps teams for monitoring Kubernetes clusters. Developers in particular can use the PromQL query language to query specific metrics.

In contrast, Checkmk is mainly used by administrators and is suitable for the monitoring of a variety of on-premises and cloud infrastructures. More than 3,000 organizations currently use Checkmk to monitor servers, networks, storage, databases, Kubernetes, IoT and many other assets.

Checkmk can query monitoring data from Kubernetes clusters directly from Prometheus and put this data into the right context. The system administrator thus recognizes, for example, when problems in the IT infrastructure affect a container that a developer is currently using. No manual steps or switching between the individual tools are necessary. At the same time, the integration saves resources because Checkmk can retrieve the monitoring data for a system from Prometheus and does not have to pull the data again from the monitored system.

Prometheus and Checkmk provide are open source solutions and each is under continuous development. This minimizes the risk of vendor lock-ins. Currently, more than 2,000 monitoring integrations are available for Checkmk, with which Checkmk can monitor a large range of systems from diverse manufacturers.

Both Prometheus and Checkmk can be sources of alerts through iLert. This allows you to conveniently manage alerts from both tools in iLert. iLert not only informs on-call staff of potential incidents, but also helps to ensure shared progress between different teams. In addition, you can keep track of incident management procedures and ensure that your staff is resolving issues efficiently.

Other blog posts you might like:

What is Alert Fatigue in DevOps and How to Combat It With the Help of ilert

Read article ›

A Practical Introduction to Incident Management Metrics

Read article ›

What is Incident Management? Unpacking the Complexity

Read article ›

Get started with ilert

A better way to be on-call for your team.

Start for Free
Our Cookie Policy
We use cookies to improve your experience, analyze site traffic and for marketing. Learn more in our Privacy Policy.
Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.