AI-first technology for modern teams with fast response times
ilert is the AI-first incident management platform designed from the ground up as a single application and covers the entire incident response lifecycle.
Share your scheduling needs in a simple, chat-like interface. Add team members, rotation rules, and timeframes — and get a ready-to-use on-call calendar everyone can access.
Let AI take the call
Introducing the ilert AI Voice Agent—your first responder for calls, gathering key details and informing your on-call engineers.
Status updates in no time
ilert AI analyzes your system and incidents, offering quick updates and managing communications for efficient issue resolution.
ilert Responder – your real-time incident advisor
ilert Responder is an intelligent agent that analyzes incidents in real time. It connects to your observability stack, investigates alerts across systems, and surfaces actionable insights, without taking control away from your team.
Features
Analyze logs, metrics, and recent changes autonomously
Identify root causes and similar past incidents
Suggest responders, rollback paths, or related service
Ask questions in natural language and get direct, evidence-backed answers
Integrations
Get started immediately using our integrations
ilert seamlessly connects with your tools using our pre-built integrations or via email. ilert integrates with monitoring, ticketing, chat, and collaboration tools.
See how industry leaders achieve 99.9% uptime with ilert
Organizations worldwide trust ilert to streamline incident management, enhance reliability, and minimize downtime. Read what our customers have to say about their experience with our platform.
AI already transforms how we detect, respond to, and resolve outages. Traditional workflows often force responders to switch between dashboards, shift through logs, and coordinate across fragmented channels under stress. This reactive, manual approach leads to slower resolution, higher operational costs, and burnout, especially as IT systems grow more complex.
At ilert, we are not just discussing the future of incident management – we are actively building it. We have brought agentic incident response into production, enabling operational excellence while reducing manual toil and cognitive load for on-call teams. Here is how we made this vision a reality.
Building the foundation: Hive and the ilert AI voice agent
Our journey into agentic incident response began with architectural decisions prioritising flexibility, scalability, and intelligent action across all stages of the incident lifecycle.
Hive: Our LLM orchestration layer
Hive is our proprietary proxy and orchestration layer for large language models (LLMs). It powers intelligent incident summaries, contextual recommendations, and advanced workflows across ilert, enabling us to manage multiple model providers, optimise workload routing, and ensure a secure, consistent, and high-performance AI backbone for all use cases.
Hive allows us to seamlessly integrate new LLMs as they emerge, control cost efficiency by routing tasks to the best-fit model, and maintain data privacy while delivering highly contextual intelligence in real time.
AI voice agent for seamless responder interaction
Communication is critical during incidents, especially when responders need to act without being tethered to keyboards. Our AI voice agent enables responders to gather updates or report incidents verbally, integrating into existing call flows as a natural part of the process. It transforms voice interactions into structured, actionable alerts while synthesising updates from diverse data sources, bridging human intuition with automated data-driven action.
What is MCP (Model Context Protocol)?
The Model Context Protocol (MCP) is a dynamic, real-time protocol built by Anthropic that connects your data to the ilert Responder, providing the rich, structured context our agents need to act intelligently during incidents.
Why did we build MCP?
Traditional integrations often leave systems disconnected, requiring manual correlation across telemetry, logs, and infrastructure state during incidents. MCP was designed to eliminate these silos by automatically aggregating, structuring, and transmitting incident-relevant context in real time.
How does MCP work?
MCP gathers data from your monitoring systems, log aggregators, deployment pipelines, and infrastructure environments, processes it within a secure, EU-compliant, multi-tenant architecture, and delivers only the necessary data to our agentic responders. By doing so, MCP:
Ensures your agent has real-time, granular awareness of incidents;
Maintains strict data security, isolation, and compliance;
Reduces manual correlation and cognitive load during critical moments;
Enables low-latency, context-rich interactions with the ilert Responder.
Think of MCP as the neural network that links your observability stack, code repositories, and infrastructure directly to our AI systems, ensuring that decisions and suggestions are always contextually accurate, actionable, and relevant.
The ilert Responder pipeline: From alert to agent-proposed actions
We designed an end-to-end pipeline that transforms monitoring signals into intelligent, actionable workflows to accelerate incident resolution.
Event Flow → Alert
ilert Event Flow ingests monitoring signals and applies your rules and thresholds to trigger alerts when specific conditions are met. This ensures the right teams are notified the moment an incident requires attention, without unnecessary noise.
MCP (Model Context Protocol) comes into play
Immediately upon alert generation, MCP retrieves and structures relevant telemetry data, logs, recent deployment changes, and infrastructure status, delivering it securely to the ilert Responder. This ensures the Responder has comprehensive situational awareness, eliminating the manual task of gathering context during incidents. This is possible through context-aware integrations with
Observability tools: To pull telemetry and time-series data from Prometheus and Grafana;
Code repositories: To access commit history and deployment metadata from GitHub;
Infrastructure environments: To gain real-time status and configurations from Kubernetes.
ilert Responder proposes actions
The ilert Responder ingests and correlates data in real time, becoming an intelligent participant in incident response rather than a passive notification system. Leveraging its deep, contextual understanding, the ilert Responder formulates actionable recommendations such as:
Root-cause suggestions,
Step-by-step remediation instructions,
Escalation paths and dependency insights.
These are presented within the ilert chat interface, allowing responders to review, approve, or modify actions for safe execution during live incidents. The interactive chat UI evolves into a command centre, enabling responders to:
Request deeper insights dynamically,
Perform direct actions like scaling Kubernetes pods,
Drill down into suggested root causes and metrics seamlessly.
Operational improvements
Agentic incident response at ilert is delivering tangible results for engineering and operations teams:
Real-time log correlation and root cause inference to pinpoint likely causes within moments;
Diagnostic summaries providing human-readable, actionable overviews of incidents;
Interactive natural language Q&A with the agent for fast data retrieval and contextual clarity;
Actionable remediation proposals with direct, safe execution workflows;
Automated post-mortems and timelines to reduce manual documentation effort post-incident.
By reducing manual toil and accelerating clarity, teams are spending less time managing incidents and more time focusing on delivering reliable services.
Key learnings and best practices
Building and operating agentic systems for mission-critical incident management at ilert has taught us:
Trust through transparency: Autonomous data collection, correlation, and safe, pre-approved actions happen without manual steps, ensuring speed and reducing cognitive load for responders. For actions with higher risk or business impact, teams can choose to add approval steps if desired. Full transparency into what the agent is doing and why builds trust, enabling responders to understand and oversee agentic actions without slowing down resolution.
Guarding against hallucinations: Rich, structured, and verified context via MCP ensures the agent works with coherent, reliable information, significantly reducing the risk of inaccurate suggestions or actions.
Performance tuning for low latency: Incident response is time-critical. Through speculative tool calls and optimised data pathways, we ensure that insights and actions are generated in near real-time, reducing MTTR when every second counts. Continuous learning: Feedback loops integrated into workflows help our agent refine its recommendations and actions over time, improving accuracy and effectiveness with every incident.
Safe autonomous execution: By defining safe, controlled scopes for automated remediation, the agent can execute corrective actions independently where appropriate, accelerating resolution while retaining operational safety and rollback capabilities.
Conclusion: Agentic incident response is already here
At ilert, we believe that the era of manual, reactive incident management is ending, and the benefits of agentic automation are too significant to delay. We are proud to bring these advanced capabilities into production, reducing toil, cutting MTTR, and empowering teams to focus on what matters most: reliability and innovation.
While ilert Responder already automates data gathering, analysis, and remediation suggestions, this release is just the first milestone. Our next goal is to let ilert Responder resolve well-understood, low-risk incidents – like flaky health checks or transient latency spikes – entirely on its own. Human responders stay in control, but much of the routine toil will fade away.
Imagine it’s 2 AM and a critical system flatlines without warning. A bleary-eyed on-call engineer scrambles to restore service, shielding customers from a major outage that could torpedo your next Service Level Objective (SLO) review. Yet when daylight returns, debates over fair on-call compensation start all over again: What’s “just” pay for sleepless nights, unpredictable pings, and rapid-fire incident responses?
What counts as on-call?
On-call is a special working hour arrangement under employment law. It comes into effect when the employee is obliged to be contactable at least by phone, so they can start work in an emergency. On-call duty is generally counted as time specifically meant for work purposes.
In practice, this means that employees are normally not allowed to work while on call. However, there may be exceptions. For example, on-call employees may also work from home if they can be reached through their work device.
What's the difference between on-call and stand-by service?
There’s a time-and-location gap between the two models:
On-call – employees stay reachable (phone, pager, or on-call management app) and can log in from anywhere when an alert fires.
Stand-by – staff must be physically present on site and ready to act immediately. German labour law labels this Bereitschaftsdienst as working time and treats it accordingly.
In IT operations, remote on-call service is usually preferred because most incidents (code rollbacks, config tweaks) can be resolved over VPN. Stand-by still matters for latency-critical environments, for example, trading platforms or industrial control systems, where a technician must monitor hardware and intervene within seconds to meet strict service-level agreements.
Are on-call hours the same as work hours?
Whether on-call duty counts as working hours isn’t as clear-cut as it looks. Under most labour-law frameworks – including Occupational Safety and Health guidance and the U.S. FLSA Fact Sheet #22 – passive on-call time is treated as rest time as long as no alert comes in. The moment you’re paged and start troubleshooting, those minutes flip to active working time. In borderline cases, courts (e.g., Germany’s BAG, Oct 2023 ruling 6 AZR 210/22) decide which periods qualify, so definitions often vary by jurisdiction and company policy.
There’s also no universal rule on pay. Many employers treat on-call duty as billable work and compensate it accordingly; others classify passive standby as unpaid availability. If your firm uses the latter model, remember you won’t be reimbursed for simply being reachable.
Bottom line: on-call time isn’t always the same as working time – it hinges on the organisation’s compensation policy. Some U.S. big-tech companies (Airbnb, Apple, Netflix) don’t pay for passive standby, while many European tech firms do.
On-call duty times
On-call scheduling is usually confined to specific nights or weekends agreed in advance and written into the employment contract. Because fewer staff are on site during those hours, reliable night- and weekend coverage is essential.
In Germany, the ICT trade group Bitkom recommends capping on-call assignments at 56 days per calendar year and guaranteeing at least 8 consecutive hours of rest per shift – Bitkom’s guideline on Rufbereitschaft im IT-Betrieb. On-call duty is generally classified as non-working time, so the usual 11-hour rest break required by §5 (1) of the Arbeitszeitgesetz does not apply until the engineer has actively worked on an incident.
Need an easy way to keep those limits visible? ilert’s on-call scheduling shows every planned rotation and actual shifts at a glance, so teams stay compliant without spreadsheets.
How is payment settled for on-call service in IT companies?
In IT companies, on-call hours are usually considered working time and are paid as such. As mentioned above, be sure to clarify this with your employer in advance to check what is stated in your contract.
For large corporations like Airbnb or Apple, which do not pay for on-call time, the argument is that their employees are already among the top earners. This means that their employees still earn much more than they would at most companies that pay on-call time in addition to their salary.
In Germany, there is no specific law regarding how on-call hours should be paid. This is, therefore, left up to the employer’s discretion. Most commonly, however, on-call duty is generally paid working time, i.e., the employee receives payment for the time he or she is on-call. This can be structured in different ways.
In practice, on-call time is often compensated either on top of the standard hourly wage or with time off. In many companies, on-call time is also counted as working time and is paid for accordingly. However, this is only possible if the employee is working rather than being only available by phone. As already mentioned, this would be the case while working from home.
In most tech organisations, hours spent on-call count as paid working time, yet the formula changes from company to company. Before you join a rota, double-check your contract or the internal on-call compensation policy.
In practice, you’ll see two common models:
Hourly uplift
A percentage on top of the standard rate for every scheduled standby hour.
Time-off swap
Eight hours on-call earn four hours of paid leave.
Remember, only the minutes you actively work are universally classed as working time; simply being reachable may stay unpaid unless your company’s policy says otherwise.
How are on-call services paid in IT companies?
Pay still varies by company size, sector, and risk profile. The federal collective agreement for public employees (TVöD) specifies the following allowances in § 8 Abs. 3:
Stand-by shifts of 12 hours or longer
Weekdays (Mon–Fri): paid at 2 X the hourly rate for the entire day.
Weekends and public holidays: paid at 4 X the hourly rate for the entire day.
Shorter stand-by windows (under 12 h)
Earn an additional 12.5 % of your hourly rate for each hour on call.
For work in a large corporation or a successful start-up, you can expect to earn about €1,000 per week. At Zalando, the on-call compensation is roughly €1,050; at the start-up HelloFresh, €1,000; and at Amazon Germany, about €800. Several companies in the financial sector offer comparable rates, although exact amounts vary. Here are the stats provided by Pragmatic Engineer blog:
SumUp (Germany): €1,050 per week
N26 (Germany): €880 per week
Klarna (Europe): €500 per week
Mastercard (UK): £470 per week
PayPal (Germany): $350 per week
Wise (UK): £300 per week
Recent engineer forums and community posts add further reference points:
Google – Tier-1 SRE rota (five-minute response): paid for 40 minutes of every on-call hour outside office hours (66% of the base hourly rate). Tier-2 (30-minute response): 20 minutes per hour (33 %).
AWS (EU Tier-0 services) – 25% of base pay for each out-of-hours on-call hour, plus a half-day of paid time off for every Saturday or night-time page.
Beyond payment: safeguarding on-call well-being
Pay isn’t the only lever that matters. On-call duty disrupts normal sleep patterns and life outside work, so protecting responders’ well-being is critical. Your team will cope far better if you follow these five practices:
Set crystal-clear expectations for response windows and escalation paths.
Rotate shifts fairly with primary + secondary roles,use an automated on-call schedule so the rota is transparent.
Watch the workload: track pages per engineer and cap consecutive overnights withon-call reports.
Leverage tooling- alert deduplication and smart escalations in ilert’s on-call management cut noise and shorten time-to-sleep.
Provide regular training and support- run quarterly fire-drills or gamedays so responders stay confident under fire.
Quick summary
On-call duty in IT means being reachable outside normal hours to respond to incidents, usually remotely. It differs from standby service, which requires physical presence and is always counted as working time. Legally, on-call time isn’t always paid, only active incident response typically counts as working time. Compensation varies: some companies offer hourly uplifts or time-off swaps, while others, like Apple or Airbnb, don’t pay for passive standby. In Germany, Bitkom recommends no more than 56 on-call days per year with 8-hour rest shifts. Weekly stipends range from €800 to €1,050 at firms like Zalando, HelloFresh, and SumUp. To protect engineers, best practices include fair rotations, clear escalation paths, tooling to reduce alert noise, and regular training
Imagine incidents resolved through insights, not manual investigations.
Picture an incident management future where you're never alone during critical alerts. Imagine your best engineer always available, tirelessly investigating issues, analyzing logs, correlating metrics, checking recent code changes, and delivering actionable insights, instantly. Today, ilert is stepping boldly into this future with our first intelligent agent: ilert Responder.
Why AI-first?
Incident management is evolving rapidly. Systems grow complex, alert volumes surge, and pressure on teams intensifies. SREs often find themselves overwhelmed by noise, urgently navigating logs, metrics, and dashboards to uncover root causes.
At ilert, we've been pioneering AI in incident management for over three years, launching intelligent alert grouping, automated post-mortem creation, and more. ilert Responder is not a beginning, but a leap forward, building on years of experience, foundational work, and customer feedback.
We're laser-focused on helping companies significantly reduce Mean Time to Resolution (MTTR). Every decision, every feature we develop revolves around one question: How does this contribute to lowering MTTR? With GenAI and agentic systems, we see transformative potential to contribute to this goal. We’re betting on a future where you're only paged about an incident if AI can't autonomously resolve it first. Imagine no more waking up at 3 a.m. just to restart a service or roll back a deployment.
That’s why we’re committed to becoming an AI-first platform, embedding artificial intelligence at the heart of everything we do. This isn't just adding AI as a feature; it's fundamentally reimagining incident response for the better.
Meet ilert Responder: Your 24/7 incident co-pilot
ilert Responder is your trusted teammate and is built directly into the ilert platform. It:
Connects directly with your observability stack, your cloud infrastructure, and code repository.
Analyzes incidents in real-time using various data sources, pinpointing root causes.
Provides clear, prioritized recommendations for remediation.
Interact seamlessly via a chat-based interface, ask questions, share context, and receive guidance precisely when you need it. Every insight from ilert Responder is clear, actionable, backed up with supporting data, even under pressure.
Under the hood, we’re using the MCP (Model-Context Protocol) to connect the ilert Responder agent with your tools and infrastructure. MCP is to AI what HTTP is to the web – a standardized protocol for connecting LLM-based agents to the systems where real data lives. It solves two key challenges: the limited, outdated knowledge and lack of context of LLMs, and the complexity of maintaining custom integration layers between AI apps and external data sources.
With MCP, ilert Responder can securely and contextually interact with tools like Grafana, Prometheus, GitHub and Kubernetes – fetching logs, metrics, deployment data, code changes, and more in real time. We've built a scalable, multi-tenant architecture around MCP that allows us to easily add new data sources (MCP servers), continuously expanding Responder’s investigative capabilities with every integration.
See the ilert Responder in action
Introducing Agentic Incident Management
ilert Responder marks the start of what we call Agentic Incident Management. Here, intelligent agents:
Reason and investigate like seasoned engineers.
Learn from each interaction, growing continuously smarter.
Work alongside humans transparently, always with clear oversight.
By default, the new ilert Responder operates in read-only mode and provides you with recommendations for faster resolution. It doesn't replace your on-call team but augments it.
Join the ilert Agentic Incident Response beta program
We're inviting innovative teams to join our Beta program, granting early access to ilert Responder. Beta testers will:
Directly shape future capabilities.
Enjoy early benefits and competitive advantages.
Lead their industries into the future of AI-powered incident management.
Interested? Email us at support@ilert.com and become a pioneer in the AI-first incident response revolution.
AI-first incident management – for everyone
AI features have been an integrated part of ilert for a few years now. With this next step, AI features are no longer reserved for premium plans or add-ons; they're foundational.
We have already introduced some significant changes in our pricing. While in the Beta phase, ilert AI Responder is available at no additional cost. But that's not all. You will notice that we’re discontinuing our AIOps add-on and making AI features such as intelligent alert grouping available in the Scale plan. Even Free ilert customers now have access to ilert AI features.
Soon, all ilert customers will have flexible AI credits to unlock advanced capabilities – from ilert Responder to ilert postmortem creator. Details on our transparent, credit-based pricing model will follow shortly. Stay tuned to our blog and newsletter.
Privacy-first, always
Privacy and security of your data remain the highest priority for us. We champion AI-first to accelerate incident resolution, embedding Privacy First with data sovereignty, end-to-end encryption, region-specific AI-processing, and GDPR compliance. From day one, we’re building intelligent systems that are as protective of confidentiality as they are innovative. ilert AI Responder is built on this basis.
For ilert AI, we use foundational models hosted by AWS, Microsoft Azure, or OpenAI, depending on your location. For EU customers, all AI processing happens within Europe using AWS or Microsoft infrastructure – no data leaves the EU, and no personal or sensitive information is sent to OpenAI’s global endpoints. Customers outside the EU may use OpenAI or AWS, always under strict access controls and encryption at rest and in transit.
Moreover, we only use alert incident-related data and don’t share personal or user-level performance data with external AI models. We also don't use your data to train models and have opted out of data training with all of our LLM providers.
We're entering a new era in incident response, one where AI doesn't replace SREs, it elevates them. ilert Responder is just the beginning. The future is collaborative, intelligent, and human-centered. Let's build it together.