BLOG

How we built agentic incident response

July 2, 2025

Share this article:

Table of Contents:

‍

AI already transforms how we detect, respond to, and resolve outages. Traditional workflows often force responders to switch between dashboards, shift through logs, and coordinate across fragmented channels under stress. This reactive, manual approach leads to slower resolution, higher operational costs, and burnout, especially as IT systems grow more complex.

‍

At ilert, we are not just discussing the future of incident management – we are actively building it. We have brought agentic incident response into production, enabling operational excellence while reducing manual toil and cognitive load for on-call teams. Here is how we made this vision a reality.

‍

Read our blog post on agentic incident response and the introduction of ilert Responder.

Building the foundation: Hive and the ilert AI voice agent

Our journey into agentic incident response began with architectural decisions prioritising flexibility, scalability, and intelligent action across all stages of the incident lifecycle.

‍

Hive: Our LLM orchestration layer

Hive is our proprietary proxy and orchestration layer for large language models (LLMs). It powers intelligent incident summaries, contextual recommendations, and advanced workflows across ilert, enabling us to manage multiple model providers, optimise workload routing, and ensure a secure, consistent, and high-performance AI backbone for all use cases.

‍

Hive allows us to seamlessly integrate new LLMs as they emerge, control cost efficiency by routing tasks to the best-fit model, and maintain data privacy while delivering highly contextual intelligence in real time.

‍

AI voice agent for seamless responder interaction

Communication is critical during incidents, especially when responders need to act without being tethered to keyboards. Our AI voice agent enables responders to gather updates or report incidents verbally, integrating into existing call flows as a natural part of the process. It transforms voice interactions into structured, actionable alerts while synthesising updates from diverse data sources, bridging human intuition with automated data-driven action.

What is MCP (Model Context Protocol)?

The Model Context Protocol (MCP) is a dynamic, real-time protocol built by Anthropic that connects your data to the ilert Responder, providing the rich, structured context our agents need to act intelligently during incidents.

‍

Why did we build MCP?

Traditional integrations often leave systems disconnected, requiring manual correlation across telemetry, logs, and infrastructure state during incidents. MCP was designed to eliminate these silos by automatically aggregating, structuring, and transmitting incident-relevant context in real time.

‍

How does MCP work?

MCP gathers data from your monitoring systems, log aggregators, deployment pipelines, and infrastructure environments, processes it within a secure, EU-compliant, multi-tenant architecture, and delivers only the necessary data to our agentic responders. By doing so, MCP:

‍

Ensures your agent has real-time, granular awareness of incidents;
Maintains strict data security, isolation, and compliance;
Reduces manual correlation and cognitive load during critical moments;
Enables low-latency, context-rich interactions with the ilert Responder.

‍

Think of MCP as the neural network that links your observability stack, code repositories, and infrastructure directly to our AI systems, ensuring that decisions and suggestions are always contextually accurate, actionable, and relevant.

The ilert Responder pipeline: From alert to agent-proposed actions

We designed an end-to-end pipeline that transforms monitoring signals into intelligent, actionable workflows to accelerate incident resolution.

‍

Event Flow → Alert

ilert Event Flow ingests monitoring signals and applies your rules and thresholds to trigger alerts when specific conditions are met. This ensures the right teams are notified the moment an incident requires attention, without unnecessary noise.

‍

MCP (Model Context Protocol) comes into play

Immediately upon alert generation, MCP retrieves and structures relevant telemetry data, logs, recent deployment changes, and infrastructure status, delivering it securely to the ilert Responder. This ensures the Responder has comprehensive situational awareness, eliminating the manual task of gathering context during incidents. This is possible through context-aware integrations with

Observability tools: To pull telemetry and time-series data from Prometheus and Grafana;
Code repositories: To access commit history and deployment metadata from GitHub;
Infrastructure environments: To gain real-time status and configurations from Kubernetes.

‍

ilert Responder proposes actions

The ilert Responder ingests and correlates data in real time, becoming an intelligent participant in incident response rather than a passive notification system. Leveraging its deep, contextual understanding, the ilert Responder formulates actionable recommendations such as:

Root-cause suggestions,
Step-by-step remediation instructions,
Escalation paths and dependency insights.

These are presented within the ilert chat interface, allowing responders to review, approve, or modify actions for safe execution during live incidents. The interactive chat UI evolves into a command centre, enabling responders to:

Request deeper insights dynamically,
Perform direct actions like scaling Kubernetes pods,
Drill down into suggested root causes and metrics seamlessly.

Operational improvements

Agentic incident response at ilert is delivering tangible results for engineering and operations teams:

‍

Real-time log correlation and root cause inference to pinpoint likely causes within moments;
Diagnostic summaries providing human-readable, actionable overviews of incidents;
Interactive natural language Q&A with the agent for fast data retrieval and contextual clarity;
Actionable remediation proposals with direct, safe execution workflows;
Automated post-mortems and timelines to reduce manual documentation effort post-incident.

‍

By reducing manual toil and accelerating clarity, teams are spending less time managing incidents and more time focusing on delivering reliable services.

Key learnings and best practices

Building and operating agentic systems for mission-critical incident management at ilert has taught us:

‍

Trust through transparency: Autonomous data collection, correlation, and safe, pre-approved actions happen without manual steps, ensuring speed and reducing cognitive load for responders. For actions with higher risk or business impact, teams can choose to add approval steps if desired. Full transparency into what the agent is doing and why builds trust, enabling responders to understand and oversee agentic actions without slowing down resolution.
Guarding against hallucinations: Rich, structured, and verified context via MCP ensures the agent works with coherent, reliable information, significantly reducing the risk of inaccurate suggestions or actions.
Performance tuning for low latency: Incident response is time-critical. Through speculative tool calls and optimised data pathways, we ensure that insights and actions are generated in near real-time, reducing MTTR when every second counts.
Continuous learning: Feedback loops integrated into workflows help our agent refine its recommendations and actions over time, improving accuracy and effectiveness with every incident.
Safe autonomous execution: By defining safe, controlled scopes for automated remediation, the agent can execute corrective actions independently where appropriate, accelerating resolution while retaining operational safety and rollback capabilities.

Conclusion: Agentic incident response is already here

At ilert, we believe that the era of manual, reactive incident management is ending, and the benefits of agentic automation are too significant to delay. We are proud to bring these advanced capabilities into production, reducing toil, cutting MTTR, and empowering teams to focus on what matters most: reliability and innovation.

‍

While ilert Responder already automates data gathering, analysis, and remediation suggestions, this release is just the first milestone. Our next goal is to let ilert Responder resolve well-understood, low-risk incidents – like flaky health checks or transient latency spikes – entirely on its own. Human responders stay in control, but much of the routine toil will fade away.

‍

Want to see it in action? Explore the ilert Responder, join our beta program, or contact us for a personalised demo to bring agentic incident response into your on-call workflow.