AI-First Incident Management. With Privacy in Mind.

Designed to augment teams with intelligent agents while keeping humans in control.

Start for free

Book a demo

Benefits

AI-first technology for modern teams with fast response times

ilert is the AI-first incident management platform designed from the ground up as a single application and covers the entire incident response lifecycle.

Reliable & actionable alerting

Reliable alerts via voice, SMS, pushnotifications. Frictionless acknowledgement,no need to log-in anywhere.

On-call management

Always alert the right person and share on-call responsibility across your team with on-call schedules and automatic escalations.

Status pages

Build trust and communicate incidents in seconds with status pages that are connected with your infrastructure.

Incident communication

Effectively communicate IT incidents to stakeholders throughout the entire service chain in a matter of seconds.

Call routing

Directs incoming calls efficiently based on schedules and escalation paths, ensuring prompt incident response.

ChatOps

Integrates collaboration tools like Slack, streamlining incident communication and decision-making within chat channels.

ilert AI

AI-first. All-in-one incident management.

Intelligent agents for every stage of the incident lifecycle.

Discover all AI features

On-call schedule assistant

Share your scheduling needs in a simple, chat-like interface. Add team members, rotation rules, and timeframes — and get a ready-to-use on-call calendar everyone can access.

Let AI take the call

Introducing the ilert AI Voice Agent—your first responder for calls, gathering key details and informing your on-call engineers.

Status updates in no time

ilert AI analyzes your system and incidents, offering quick updates and managing communications for efficient issue resolution.

ilert Responder – your real-time incident advisor

ilert Responder is an intelligent agent that analyzes incidents in real time. It connects to your observability stack, investigates alerts across systems, and surfaces actionable insights, without taking control away from your team.

Features

Analyze logs, metrics, and recent changes autonomously
Identify root causes and similar past incidents
Suggest responders, rollback paths, or related service
Ask questions in natural language and get direct, evidence-backed answers

Integrations

Get started immediately using our integrations

ilert seamlessly connects with your tools using our pre-built integrations or via email. ilert integrates with monitoring, ticketing, chat, and collaboration tools.

Start for free

View all integrations

Transform your Incident Response today – start free trial

Start for free

Customers

See how industry leaders achieve 99.9% uptime with ilert

Organizations worldwide trust ilert to streamline incident management, enhance reliability, and minimize downtime. Read what our customers have to say about their experience with our platform.

“We are using ilert to fix our problems sooner than our customers are realizing them. ilert gives our engineering and operations teams the confidence that we will react in time.”

Dr. Robert Zores

Chief Technology Officer

“ilert has helped Ingka significantly reduce both MTTR & MTTA over the last 3 years, the collaboration with the team at ilert is what makes the difference.”

Karan Honavar

Engineering Manager at IKEA

“Other teams are now checking whether they would also use ilert.”

Thilo Maass

Manager at Adesso AG

Stay up to date

Expert insights from our blog

Engineering

How we built agentic incident response

Discover how ilert turned the vision of agentic incident response into a production reality, cutting toil, MTTR, and alert fatigue for modern on-call teams.

Tim Gühnemann

Jul 02, 2025 • 5 min read

‍

AI already transforms how we detect, respond to, and resolve outages. Traditional workflows often force responders to switch between dashboards, shift through logs, and coordinate across fragmented channels under stress. This reactive, manual approach leads to slower resolution, higher operational costs, and burnout, especially as IT systems grow more complex.

‍

At ilert, we are not just discussing the future of incident management – we are actively building it. We have brought agentic incident response into production, enabling operational excellence while reducing manual toil and cognitive load for on-call teams. Here is how we made this vision a reality.

‍

Read our blog post on agentic incident response and the introduction of ilert Responder.

Building the foundation: Hive and the ilert AI voice agent

Our journey into agentic incident response began with architectural decisions prioritising flexibility, scalability, and intelligent action across all stages of the incident lifecycle.

‍

Hive: Our LLM orchestration layer

Hive is our proprietary proxy and orchestration layer for large language models (LLMs). It powers intelligent incident summaries, contextual recommendations, and advanced workflows across ilert, enabling us to manage multiple model providers, optimise workload routing, and ensure a secure, consistent, and high-performance AI backbone for all use cases.

‍

Hive allows us to seamlessly integrate new LLMs as they emerge, control cost efficiency by routing tasks to the best-fit model, and maintain data privacy while delivering highly contextual intelligence in real time.

‍

AI voice agent for seamless responder interaction

Communication is critical during incidents, especially when responders need to act without being tethered to keyboards. Our AI voice agent enables responders to gather updates or report incidents verbally, integrating into existing call flows as a natural part of the process. It transforms voice interactions into structured, actionable alerts while synthesising updates from diverse data sources, bridging human intuition with automated data-driven action.

What is MCP (Model Context Protocol)?

The Model Context Protocol (MCP) is a dynamic, real-time protocol built by Anthropic that connects your data to the ilert Responder, providing the rich, structured context our agents need to act intelligently during incidents.

‍

Why did we build MCP?

Traditional integrations often leave systems disconnected, requiring manual correlation across telemetry, logs, and infrastructure state during incidents. MCP was designed to eliminate these silos by automatically aggregating, structuring, and transmitting incident-relevant context in real time.

‍

How does MCP work?

MCP gathers data from your monitoring systems, log aggregators, deployment pipelines, and infrastructure environments, processes it within a secure, EU-compliant, multi-tenant architecture, and delivers only the necessary data to our agentic responders. By doing so, MCP:

‍

Ensures your agent has real-time, granular awareness of incidents;
Maintains strict data security, isolation, and compliance;
Reduces manual correlation and cognitive load during critical moments;
Enables low-latency, context-rich interactions with the ilert Responder.

‍

Think of MCP as the neural network that links your observability stack, code repositories, and infrastructure directly to our AI systems, ensuring that decisions and suggestions are always contextually accurate, actionable, and relevant.

The ilert Responder pipeline: From alert to agent-proposed actions

We designed an end-to-end pipeline that transforms monitoring signals into intelligent, actionable workflows to accelerate incident resolution.

‍

Event Flow → Alert

ilert Event Flow ingests monitoring signals and applies your rules and thresholds to trigger alerts when specific conditions are met. This ensures the right teams are notified the moment an incident requires attention, without unnecessary noise.

‍

MCP (Model Context Protocol) comes into play

Immediately upon alert generation, MCP retrieves and structures relevant telemetry data, logs, recent deployment changes, and infrastructure status, delivering it securely to the ilert Responder. This ensures the Responder has comprehensive situational awareness, eliminating the manual task of gathering context during incidents. This is possible through context-aware integrations with

Observability tools: To pull telemetry and time-series data from Prometheus and Grafana;
Code repositories: To access commit history and deployment metadata from GitHub;
Infrastructure environments: To gain real-time status and configurations from Kubernetes.

‍

ilert Responder proposes actions

The ilert Responder ingests and correlates data in real time, becoming an intelligent participant in incident response rather than a passive notification system. Leveraging its deep, contextual understanding, the ilert Responder formulates actionable recommendations such as:

Root-cause suggestions,
Step-by-step remediation instructions,
Escalation paths and dependency insights.

These are presented within the ilert chat interface, allowing responders to review, approve, or modify actions for safe execution during live incidents. The interactive chat UI evolves into a command centre, enabling responders to:

Request deeper insights dynamically,
Perform direct actions like scaling Kubernetes pods,
Drill down into suggested root causes and metrics seamlessly.

Operational improvements

Agentic incident response at ilert is delivering tangible results for engineering and operations teams:

‍

Real-time log correlation and root cause inference to pinpoint likely causes within moments;
Diagnostic summaries providing human-readable, actionable overviews of incidents;
Interactive natural language Q&A with the agent for fast data retrieval and contextual clarity;
Actionable remediation proposals with direct, safe execution workflows;
Automated post-mortems and timelines to reduce manual documentation effort post-incident.

‍

By reducing manual toil and accelerating clarity, teams are spending less time managing incidents and more time focusing on delivering reliable services.

Key learnings and best practices

Building and operating agentic systems for mission-critical incident management at ilert has taught us:

‍

Trust through transparency: Autonomous data collection, correlation, and safe, pre-approved actions happen without manual steps, ensuring speed and reducing cognitive load for responders. For actions with higher risk or business impact, teams can choose to add approval steps if desired. Full transparency into what the agent is doing and why builds trust, enabling responders to understand and oversee agentic actions without slowing down resolution.
Guarding against hallucinations: Rich, structured, and verified context via MCP ensures the agent works with coherent, reliable information, significantly reducing the risk of inaccurate suggestions or actions.
Performance tuning for low latency: Incident response is time-critical. Through speculative tool calls and optimised data pathways, we ensure that insights and actions are generated in near real-time, reducing MTTR when every second counts.
Continuous learning: Feedback loops integrated into workflows help our agent refine its recommendations and actions over time, improving accuracy and effectiveness with every incident.
Safe autonomous execution: By defining safe, controlled scopes for automated remediation, the agent can execute corrective actions independently where appropriate, accelerating resolution while retaining operational safety and rollback capabilities.

Conclusion: Agentic incident response is already here

At ilert, we believe that the era of manual, reactive incident management is ending, and the benefits of agentic automation are too significant to delay. We are proud to bring these advanced capabilities into production, reducing toil, cutting MTTR, and empowering teams to focus on what matters most: reliability and innovation.

‍

While ilert Responder already automates data gathering, analysis, and remediation suggestions, this release is just the first milestone. Our next goal is to let ilert Responder resolve well-understood, low-risk incidents – like flaky health checks or transient latency spikes – entirely on its own. Human responders stay in control, but much of the routine toil will fade away.

‍

Want to see it in action? Explore the ilert Responder, join our beta program, or contact us for a personalised demo to bring agentic incident response into your on-call workflow.

Insights

On-call compensation for IT engineers in 2025

2025 guide to on-call pay: TVöD benchmark, global stipends, pay models, standby laws, and well-being best practices.

Daniel Weiß

Jun 30, 2025 • 5 min read

Imagine it’s 2 AM and a critical system flatlines without warning. A bleary-eyed on-call engineer scrambles to restore service, shielding customers from a major outage that could torpedo your next Service Level Objective (SLO) review. Yet when daylight returns, debates over fair on-call compensation start all over again: What’s “just” pay for sleepless nights, unpredictable pings, and rapid-fire incident responses?

What counts as on-call?

On-call is a special working hour arrangement under employment law. It comes into effect when the employee is obliged to be contactable at least by phone, so they can start work in an emergency. On-call duty is generally counted as time specifically meant for work purposes.

‍

In practice, this means that employees are normally not allowed to work while on call. However, there may be exceptions. For example, on-call employees may also work from home if they can be reached through their work device.

What's the difference between on-call and stand-by service?

There’s a time-and-location gap between the two models:

‍

On-call – employees stay reachable (phone, pager, or on-call management app) and can log in from anywhere when an alert fires.
Stand-by – staff must be physically present on site and ready to act immediately. German labour law labels this Bereitschaftsdienst as working time and treats it accordingly.

In IT operations, remote on-call service is usually preferred because most incidents (code rollbacks, config tweaks) can be resolved over VPN. Stand-by still matters for latency-critical environments, for example, trading platforms or industrial control systems, where a technician must monitor hardware and intervene within seconds to meet strict service-level agreements.

Are on-call hours the same as work hours?

Whether on-call duty counts as working hours isn’t as clear-cut as it looks. Under most labour-law frameworks – including Occupational Safety and Health guidance and the U.S. FLSA Fact Sheet #22 – passive on-call time is treated as rest time as long as no alert comes in. The moment you’re paged and start troubleshooting, those minutes flip to active working time. In borderline cases, courts (e.g., Germany’s BAG, Oct 2023 ruling 6 AZR 210/22) decide which periods qualify, so definitions often vary by jurisdiction and company policy.

‍

There’s also no universal rule on pay. Many employers treat on-call duty as billable work and compensate it accordingly; others classify passive standby as unpaid availability. If your firm uses the latter model, remember you won’t be reimbursed for simply being reachable.

‍

Bottom line: on-call time isn’t always the same as working time – it hinges on the organisation’s compensation policy. Some U.S. big-tech companies (Airbnb, Apple, Netflix) don’t pay for passive standby, while many European tech firms do.

On-call duty times

On-call scheduling is usually confined to specific nights or weekends agreed in advance and written into the employment contract. Because fewer staff are on site during those hours, reliable night- and weekend coverage is essential.

‍

In Germany, the ICT trade group Bitkom recommends capping on-call assignments at 56 days per calendar year and guaranteeing at least 8 consecutive hours of rest per shift – Bitkom’s guideline on Rufbereitschaft im IT-Betrieb. On-call duty is generally classified as non-working time, so the usual 11-hour rest break required by §5 (1) of the Arbeitszeitgesetz does not apply until the engineer has actively worked on an incident.

‍

Need an easy way to keep those limits visible? ilert’s on-call scheduling shows every planned rotation and actual shifts at a glance, so teams stay compliant without spreadsheets.

‍

How is payment settled for on-call service in IT companies?

In IT companies, on-call hours are usually considered working time and are paid as such. As mentioned above, be sure to clarify this with your employer in advance to check what is stated in your contract.

‍

For large corporations like Airbnb or Apple, which do not pay for on-call time, the argument is that their employees are already among the top earners. This means that their employees still earn much more than they would at most companies that pay on-call time in addition to their salary.

‍

In Germany, there is no specific law regarding how on-call hours should be paid. This is, therefore, left up to the employer’s discretion. Most commonly, however, on-call duty is generally paid working time, i.e., the employee receives payment for the time he or she is on-call. This can be structured in different ways.

‍

In practice, on-call time is often compensated either on top of the standard hourly wage or with time off. In many companies, on-call time is also counted as working time and is paid for accordingly. However, this is only possible if the employee is working rather than being only available by phone. As already mentioned, this would be the case while working from home.

‍

In most tech organisations, hours spent on-call count as paid working time, yet the formula changes from company to company. Before you join a rota, double-check your contract or the internal on-call compensation policy.

‍

In practice, you’ll see two common models:

Hourly uplift

A percentage on top of the standard rate for every scheduled standby hour.

Time-off swap

Eight hours on-call earn four hours of paid leave.

‍

Remember, only the minutes you actively work are universally classed as working time; simply being reachable may stay unpaid unless your company’s policy says otherwise.

How are on-call services paid in IT companies?

Pay still varies by company size, sector, and risk profile. The federal collective agreement for public employees (TVöD) specifies the following allowances in § 8 Abs. 3:

Stand-by shifts of 12 hours or longer

Weekdays (Mon–Fri): paid at 2 X the hourly rate for the entire day.

Weekends and public holidays: paid at 4 X the hourly rate for the entire day.

Shorter stand-by windows (under 12 h)

Earn an additional 12.5 % of your hourly rate for each hour on call.

‍

For work in a large corporation or a successful start-up, you can expect to earn about €1,000 per week. At Zalando, the on-call compensation is roughly €1,050; at the start-up HelloFresh, €1,000; and at Amazon Germany, about €800. Several companies in the financial sector offer comparable rates, although exact amounts vary. Here are the stats provided by Pragmatic Engineer blog:

‍

SumUp (Germany): €1,050 per week
N26 (Germany): €880 per week
Klarna (Europe): €500 per week
Mastercard (UK): £470 per week
PayPal (Germany): $350 per week
Wise (UK): £300 per week

Recent engineer forums and community posts add further reference points:

‍

Google – Tier-1 SRE rota (five-minute response): paid for 40 minutes of every on-call hour outside office hours (66% of the base hourly rate). Tier-2 (30-minute response): 20 minutes per hour (33 %).
AWS (EU Tier-0 services) – 25% of base pay for each out-of-hours on-call hour, plus a half-day of paid time off for every Saturday or night-time page.

Beyond payment: safeguarding on-call well-being

Pay isn’t the only lever that matters. On-call duty disrupts normal sleep patterns and life outside work, so protecting responders’ well-being is critical. Your team will cope far better if you follow these five practices:

‍

Set crystal-clear expectations for response windows and escalation paths.
Rotate shifts fairly with primary + secondary roles,use an automated on-call schedule so the rota is transparent.
Watch the workload: track pages per engineer and cap consecutive overnights with on-call reports.
Leverage tooling- alert deduplication and smart escalations in ilert’s on-call management cut noise and shorten time-to-sleep.
Provide regular training and support- run quarterly fire-drills or gamedays so responders stay confident under fire.

Quick summary

On-call duty in IT means being reachable outside normal hours to respond to incidents, usually remotely. It differs from standby service, which requires physical presence and is always counted as working time. Legally, on-call time isn’t always paid, only active incident response typically counts as working time. Compensation varies: some companies offer hourly uplifts or time-off swaps, while others, like Apple or Airbnb, don’t pay for passive standby. In Germany, Bitkom recommends no more than 56 on-call days per year with 8-hour rest shifts. Weekly stipends range from €800 to €1,050 at firms like Zalando, HelloFresh, and SumUp. To protect engineers, best practices include fair rotations, clear escalation paths, tooling to reduce alert noise, and regular training

‍

Announcements

ilert introduces Agentic Incident Response: Entering the AI-first era

Meet ilert Responder: your AI incident co-pilot that investigates alerts, accelerates incident resolution, and empowers your SRE team without taking control away.

Birol Yildiz

Jun 16, 2025 • 5 min read

Imagine incidents resolved through insights, not manual investigations.

‍

Picture an incident management future where you're never alone during critical alerts. Imagine your best engineer always available, tirelessly investigating issues, analyzing logs, correlating metrics, checking recent code changes, and delivering actionable insights, instantly. Today, ilert is stepping boldly into this future with our first intelligent agent: ilert Responder.

Why AI-first?

Incident management is evolving rapidly. Systems grow complex, alert volumes surge, and pressure on teams intensifies. SREs often find themselves overwhelmed by noise, urgently navigating logs, metrics, and dashboards to uncover root causes.

‍

At ilert, we've been pioneering AI in incident management for over three years, launching intelligent alert grouping, automated post-mortem creation, and more. ilert Responder is not a beginning, but a leap forward, building on years of experience, foundational work, and customer feedback.

‍

We're laser-focused on helping companies significantly reduce Mean Time to Resolution (MTTR). Every decision, every feature we develop revolves around one question: How does this contribute to lowering MTTR? With GenAI and agentic systems, we see transformative potential to contribute to this goal. We’re betting on a future where you're only paged about an incident if AI can't autonomously resolve it first. Imagine no more waking up at 3 a.m. just to restart a service or roll back a deployment.

‍

That’s why we’re committed to becoming an AI-first platform, embedding artificial intelligence at the heart of everything we do. This isn't just adding AI as a feature; it's fundamentally reimagining incident response for the better.

Meet ilert Responder: Your 24/7 incident co-pilot

ilert Responder is your trusted teammate and is built directly into the ilert platform. It:

‍

Connects directly with your observability stack, your cloud infrastructure, and code repository.
Analyzes incidents in real-time using various data sources, pinpointing root causes.
Provides clear, prioritized recommendations for remediation.

‍

Interact seamlessly via a chat-based interface, ask questions, share context, and receive guidance precisely when you need it. Every insight from ilert Responder is clear, actionable, backed up with supporting data, even under pressure.

‍

Under the hood, we’re using the MCP (Model-Context Protocol) to connect the ilert Responder agent with your tools and infrastructure. MCP is to AI what HTTP is to the web – a standardized protocol for connecting LLM-based agents to the systems where real data lives. It solves two key challenges: the limited, outdated knowledge and lack of context of LLMs, and the complexity of maintaining custom integration layers between AI apps and external data sources.

‍

With MCP, ilert Responder can securely and contextually interact with tools like Grafana, Prometheus, GitHub and Kubernetes – fetching logs, metrics, deployment data, code changes, and more in real time. We've built a scalable, multi-tenant architecture around MCP that allows us to easily add new data sources (MCP servers), continuously expanding Responder’s investigative capabilities with every integration.

See the ilert Responder in action

‍

Introducing Agentic Incident Management

‍

ilert Responder marks the start of what we call Agentic Incident Management. Here, intelligent agents:

‍

Reason and investigate like seasoned engineers.
Learn from each interaction, growing continuously smarter.
Work alongside humans transparently, always with clear oversight.

‍

By default, the new ilert Responder operates in read-only mode and provides you with recommendations for faster resolution. It doesn't replace your on-call team but augments it.

Join the ilert Agentic Incident Response beta program

‍

We're inviting innovative teams to join our Beta program, granting early access to ilert Responder. Beta testers will:

‍

Directly shape future capabilities.
Enjoy early benefits and competitive advantages.
Lead their industries into the future of AI-powered incident management.

‍

Interested? Email us at support@ilert.com and become a pioneer in the AI-first incident response revolution.

‍

AI-first incident management – for everyone

AI features have been an integrated part of ilert for a few years now. With this next step, AI features are no longer reserved for premium plans or add-ons; they're foundational.

‍

We have already introduced some significant changes in our pricing. While in the Beta phase, ilert AI Responder is available at no additional cost. But that's not all. You will notice that we’re discontinuing our AIOps add-on and making AI features such as intelligent alert grouping available in the Scale plan. Even Free ilert customers now have access to ilert AI features.

‍

Soon, all ilert customers will have flexible AI credits to unlock advanced capabilities – from ilert Responder to ilert postmortem creator. Details on our transparent, credit-based pricing model will follow shortly. Stay tuned to our blog and newsletter.

Privacy-first, always

Privacy and security of your data remain the highest priority for us. We champion AI-first to accelerate incident resolution, embedding Privacy First with data sovereignty, end-to-end encryption, region-specific AI-processing, and GDPR compliance. From day one, we’re building intelligent systems that are as protective of confidentiality as they are innovative. ilert AI Responder is built on this basis.

‍

For ilert AI, we use foundational models hosted by AWS, Microsoft Azure, or OpenAI, depending on your location. For EU customers, all AI processing happens within Europe using AWS or Microsoft infrastructure – no data leaves the EU, and no personal or sensitive information is sent to OpenAI’s global endpoints. Customers outside the EU may use OpenAI or AWS, always under strict access controls and encryption at rest and in transit.

‍

Moreover, we only use alert incident-related data and don’t share personal or user-level performance data with external AI models. We also don't use your data to train models and have opted out of data training with all of our LLM providers.

‍

Learn more about our security and privacy commitment in the Q&A section.

‍

We're entering a new era in incident response, one where AI doesn't replace SREs, it elevates them. ilert Responder is just the beginning. The future is collaborative, intelligent, and human-centered. Let's build it together.

‍

Explore all

AI-First Incident Management. With Privacy in Mind.

AI-first technology for modern teams with fast response times

Reliable & actionable alerting

On-call management

Status pages

Incident communication

Call routing

ChatOps

AI-first. All-in-one incident management.

On-call schedule assistant

Let AI take the call

Status updates in no time

ilert Responder – your real-time incident advisor

Features

Get started immediately using our integrations

See how industry leaders achieve 99.9% uptime with ilert

Expert insights from our blog

How we built agentic incident response

Building the foundation: Hive and the ilert AI voice agent

Hive: Our LLM orchestration layer

AI voice agent for seamless responder interaction

What is MCP (Model Context Protocol)?

Why did we build MCP?

How does MCP work?

The ilert Responder pipeline: From alert to agent-proposed actions

Event Flow → Alert

MCP (Model Context Protocol) comes into play

ilert Responder proposes actions

Operational improvements

Key learnings and best practices

Conclusion: Agentic incident response is already here

On-call compensation for IT engineers in 2025

What counts as on-call?

What's the difference between on-call and stand-by service?

Are on-call hours the same as work hours?

On-call duty times

How is payment settled for on-call service in IT companies?

Hourly uplift

Time-off swap

How are on-call services paid in IT companies?

Stand-by shifts of 12 hours or longer

Shorter stand-by windows (under 12 h)

Beyond payment: safeguarding on-call well-being

Quick summary

ilert introduces Agentic Incident Response: Entering the AI-first era

Why AI-first?

Meet ilert Responder: Your 24/7 incident co-pilot

See the ilert Responder in action

Introducing Agentic Incident Management

Join the ilert Agentic Incident Response beta program

AI-first incident management – for everyone

Privacy-first, always

The solution for operation teams.