Read More

One platform for alerting, on-call management and status pages.

Manage on-call, respond to incidents and communicate them via status pages using a single application.

Trusted by leading companies

Highlights

The features you need to operate always-on-services

Every feature in ilert is built to help you to respond to incidents faster and increase uptime.

Explore our features

Harness the power of generative AI

Enhance incident communication and streamline post-mortem creation with ilert Al. ilert AI helps your business to respond faster to incidents.

Read more
Integrations

Get started immediately using our integrations

ilert seamlessly connects with your tools using out pre-built integrations or via email. ilert integrates with monitoring, ticketing, chat, and collaboration tools.

Ready to elevate your incident management?
Start for free
Customers

What customers are saying about us

We have transformed our incident management process with ilert. Our platform is intuitive, reliable, and has greatly improved our team's response time.

ilert has helped Ingka significantly reduce both MTTR & MTTA over the last 3 years, the collaboration with the team at ilert is what makes the difference. ilert has been top notch to address even the smallest needs from Ingka and have consistently delivered on the product roadmap. This has inspired the confidence of our consumers making us a 'go to' for all on call management & status pages.

Karan Honavar
Engineering Manager at IKEA

ilert is a low maintenance solution, it simply delivers [...] as a result, the mental load has gone.

Tim Dauer
VP Tech

We even recommend ilert to our own customers.

Maximilian Krieg
Leader Of Managed Network & Security

We are using ilert to fix our problems sooner than our customers are realizing them. ilert gives our engineering and operations teams the confidence that we will react in time.

Dr. Robert Zores
Chief Technology Officer

ilert has proven to be a reliable and stable solution. Support for the very minor issues that occured within seven years has been outstanding and more than 7,000 incidents have been handled via ilert.

Stefan Hierlmeier
Service Delivery Manager

The overall experience is actually absolutely great and I'm very happy that we decided to use this product and your services.

Timo Manuel Junge
Head Of Microsoft Systems & Services

The easy integration of alert sources and the reliability of the alerts convinced us. The app offers our employee an easy way to respond to incidents.

Stephan Mund
ASP Manager
Stay up to date

New from our blog

Insights

Leveraging AI for Efficient On-call Scheduling

This article introduces the use cases of GenAI across the stages of the incident management process, beginning with the preparation stage. It explains how AI can be leveraged for efficient, effective, and accurate on-call scheduling, including examples from ilertAI.

Sirine Karray
Jul 26, 2024 • 5 min read

Introduction

Regardless of industry specifications, creating and maintaining a highly functional incident management process is crucial for organizations of all sizes. The various potential applications of Generative AI in this process can significantly enhance the efficiency, accuracy, and speed of incident detection, analysis, and resolution. GenAI can be utilized across all stages of the incident management process, including preparation, response, communication, and learning.

In this article we will start with the preparation stage.

Prepare: Using AI Assistants for On-call Scheduling

Creating an on-call schedule that balances team needs and ensures coverage is crucial for incident management. AI Assistants can streamline this process. By employing AI Assistants, complex scheduling requirements, such as follow-the-sun rotations, become manageable. An intuitive chat interface powered by an LLM can guide users through setting up their schedules, asking relevant questions to understand specific requirements and preferences. This AI-assisted approach simplifies scheduling, making it less time-consuming and more tailored to the unique dynamics of each team.

The AI Assistant engages the user in a conversation to gather necessary details for the schedule. This involves asking about involved team members, rotation types, and on-call coverage. The Assistant's ability to parse natural language enables it to understand and categorize user responses into structured data that can be used in the next steps. The process begins with understanding user inputs and then executing functions to generate the schedule.

Steps for Creating an On-call Schedule

1. Understanding User Inputs:

The Assistant initiates the process by engaging the user in a conversation to gather all the necessary details for creating the schedule. This involves asking about the team members, types of rotations, and on-call coverage. Thanks to its natural language processing abilities, the Assistant can understand and organize the user's responses into structured data for the next steps. The instructions for this conversation are provided to the Assistant.

2. Executing Functions to Generate the Schedule:

After processing and organizing the input data, the Assistant uses the function calling feature to run a custom function specifically designed for schedule creation. This function takes the prepared data and designs the on-call schedule, ensuring all requirements and constraints are satisfied. The end result is a JSON document that represents the finalized on-call schedule.

This use of OpenAI's function calling feature highlights the Assistant's capability to connect conversational input with programmatic output, allowing for complex task automation like schedule creation within a conversational interface.

Below is a sample conversation with ilert AI to generate a follow-the-sun schedule:

Besides AI-assisted on-call scheduling, LLMs can be leveraged to respond to incidents by reducing noise through intelligent alert grouping, enhancing incident communications, and creating thorough postmortem analyses.

Product

How to Improve Your Service Reliability with ilert Status Pages

Four reasons why ilert status pages are the best alternative to choose. 

Daria Yankevich
Jun 27, 2024 • 5 min read

According to the Uptime Institute, during the last year, the number of IT incidents slowly declined while the average cost of every incident grew. As dependency on digital services increases, the cost for  ⅔ of all outages exceeds $100,000. Stakes are rising, and more and more companies are investing in proactive incident management. 

Although incidents are unavoidable, organizations can still reduce recovery time, maintain operational stability, and build resilience against future disruptions. Proactive incident management, including the implementation of status pages, ensures that businesses can promptly address issues, provide transparent communication, and mitigate the impact on users.

At ilert, we are consistently enhancing our suite of incident communications features, with status pages being one of the key components. In this blog post, you will learn how to maximize the potential of your ilert status page and significantly enhance this aspect of your incident management.

What is a Status Page?

Just in case it's your first time you landed here. A status page is a trust-building tool. It provides real-time information about the operational status of a service or system as a whole, serving as a transparent communication channel between a company and its users during service disruptions. 

If you are still determining if your company needs a status page, here are the main reasons ilert customers like IKEA have already implemented it.

  1. Transparency with customers. The historical data on your status page helps your potential clients evaluate your product and service. At the same time, keeping customers informed during incidents can mitigate the negative impact of service disruptions on the company’s reputation and customer satisfaction.
  2. To reduce support load. During an outage, a status page can significantly lower the number of support inquiries, as customers can get real-time updates without contacting support teams.
  3. Historical Data and reporting. Status pages include a history of past incidents and resolutions, which can help analyze patterns, improve service reliability and reporting to stakeholders.
  4. Compliance. Many industries require businesses to maintain logs of service availability and incident reports for compliance purposes. For example, ISO/IEC 27001, a standard for information security management systems, requires incident management processes. A status page can help meet these requirements by providing a clear communication channel during incidents.
  5. Accountability. Many businesses have SLAs that mandate timely communication about service status and incidents. A status page is a practical tool to fulfill these contractual obligations.

Automate Everything

Unlike other solutions that require manual updates or operate in isolation, ilert's status pages are built into your incident management platform. This tight integration allows teams to address issues swiftly and communicate effectively with stakeholders, thereby maintaining trust and reducing the impact of outages.

Update status pages automatically. Utilize alert actions to display a new status immediately when your monitoring tool sends an alert. You can trigger this action when the alert is sent to the platform or already accepted by the on-call engineer. Below are step-by-step instructions on how to enable this feature.

Delegate incident communication to AI. During an incident response, staying focused on resolving the problem is crucial. Crafting a clear and detailed update for the status page can be difficult, particularly in high-pressure situations. This is where ilert's AI-assisted incident communication proves invaluable. Use ilert AI to write clear, polite, and informative messaging for your status pages. ilert AI can automatically identify which services are affected, so you don't have to update them manually. Find more details in this article.

Easily notify about planned maintenance. Maintenance windows are scheduled periods when systems or services are offline for updates, upgrades, or repairs. Effective communication about these periods is vital to manage user expectations and minimize disruption. With ilert status pages, maintenance is automatically reflected, ensuring users are promptly informed about planned downtime. Here are the instructions on how to enable maintenance.

Status page that is always at hand. You don't have to manually send the status page link to your users and stakeholders. Embed ilert's floating status widget or status badge. The status page widget will appear only during ongoing incidents or scheduled maintenance and remain hidden when all services function normally. On the contrary, the status badge will always be visible. 

Align Your Status Page with Your Brand

With ilert, you can customize the layout of your status page to align with your brand guidelines. You can add your logo, favicon, and create service groups to organize related services, making it easier for users to see the overall health of your system. Additionally, you can select your desired layout, such as single or responsive columns. The platform supports custom domains, ensuring your status page fits seamlessly with your web presence. Here is the guide for adjusting the status page according to your brand.

Configure Status Page Visibility

As you probably noticed on the ilert pricing page, there are several options for your status page. Depending on your needs and goals, you can make the page accessible for everyone or limit access by one of the following parameters:

  • users with ilert accounts (including users with a stakeholder role)
  • selected IP addresses or IP address ranges
  • specific emails and email domains

Public status pages can be part of your SLA and provide insights into the service stability for all users on the internet. Private status pages are ideal for organizations that need to communicate service status updates to a specific group of users, such as internal teams or select customers. Private setup is beneficial for maintaining security and confidentiality, especially when dealing with sensitive operational data or managing communication for a high-value client base. 

Subscription Flexibility

ilert status page and the subscription options
ilert status page and the subscription options

The status page widget is helpful, but users and stakeholders can also choose to receive notifications proactively. All ilert status pages, except for the Free plan, provide various options to keep everyone in the loop. Users can select between email, webhook, and RSS subscriptions by clicking a Subscribe button at the top right corner of the status page. Additionally, to follow GDPR rules, ilert will automatically send reminders to those who have chosen email notifications but didn't follow the double-opt-in link. Finally, to reduce complexity and provide only relevant updates, there is an option to subscribe to specific services only. You can manage your subscriber list using the status page settings. 

Insights

6 Steps to Create Actionable Postmortems

Best practices for creating effective postmortems, ensuring that your incident analysis won't be forgotten as soon as the danger has passed

Daria Yankevich
Jun 17, 2024 • 5 min read

In DevOps and IT operations, conducting a thorough postmortem after an incident is crucial for continuous improvement. This article explores best practices for creating effective postmortems, ensuring that your incident analysis won't be forgotten as soon as the danger has passed but will be comprehensive and actionable.

What is a Postmortem?

A postmortem in DevOps is a structured process conducted after an incident or failure to analyze what happened, identify the root cause, and implement corrective actions to prevent future occurrences. It involves a detailed examination of the timeline, impact assessment, and lessons learned, fostering a culture of continuous improvement and transparency without assigning blame. The postmortem document is the final output of this process, encapsulating all the gathered information, analyses, and planned actions to be shared with relevant stakeholders.

Benefits of Conducting Postmortems

By fostering a culture focused on learning and improvement through postmortems, organizations can strengthen their infrastructure and incident response processes, making them better prepared for future incidents. The benefits of conduction postmortem include:

  • Improved recovery times.
  • Enhanced team learning and knowledge sharing.
  • Prevention of future incidents.
  • Building a culture of continuous improvement.
ilert feature showcase: create postmortem from an incident
ilert interface: create postmortem right from the incident

Postmortem Key Steps

As it's recommended in ilert's Incident Management Guide, once a major incident is resolved, the incident response lead quickly designates one of the responders to manage the postmortem process. 

Step 1: Assigner a Postmortem Owner

While creating the postmortem is a collaborative task, assigning a specific owner is essential for ensuring it is completed effectively. The postmortem owner is entrusted with several responsibilities, including:

  • Scheduling the postmortem meeting
  • Investigating the incident (drawing in the necessary expertise from other teams as required)
  • Updating the postmortem document
  • Creating follow-up action items to prevent a similar occurrence in the future.

Step 2: Schedule a Meeting

It's crucial to invite people with relevant experience and expertise, so we highly recommend checking that you have the following specialists: 

  • The Incident Response Lead
  • Owners of the services involved in the incident
  • Key engineers/responders who were involved in resolving the incident
  • Engineering and Product Managers for the impacted systems

Step 3: Build a Timeline

ilert incident timeline

Document the sequence of events objectively, without interpreting or judging the causes of the incident. The timeline should begin before the incident starts and continue until it is resolved, noting significant changes in status or impact and key actions taken by responders.

Examine the incident log in Slack or Microsoft Teams for critical decisions and actions. Also, include information that the team lacked during the incident but would have been helpful in hindsight. This information can be found in the monitoring data, logs, and deployments of the affected services.

Step 4: Documenting the Impact

Capture the incident impact from various angles. Note the duration of the observable impact, the total number of affected customers, how many reported the issue, and the severity of the functional disruption. Measure the impact using a business metric relevant to your product, such as the increase in API errors, performance slowdowns, or delays in notification delivery. If applicable, compile a list of all affected customers and share it with your support team for follow-up actions. Including any customer feedback or complaints received during the incident would also be helpful and provide context on user experience.

Step 5: Root Cause Analysis

After thoroughly understanding the incident's timeline and impact, proceed to the Root Cause Analysis (RCA) to explore the contributing factors, recognizing that complex systems often fail due to a combination of interacting elements rather than a single cause. Begin by reviewing the monitoring data of affected services, looking for irregularities such as sudden spikes or flatlining around the time of the incident. Include relevant queries, commands, graphs, or links from monitoring tools to illustrate the data collection process. If monitoring for this service is lacking, list the development of such monitoring as an action item in your postmortem. Next, identify the underlying causes by examining why the system's design allowed the incident, investigating past design decisions, and determining if they were part of a larger trend or a specific issue. Evaluate the processes, considering if collaboration, communication, and work reviews contributed to the incident, and use this stage to improve the incident response process. Summarize your findings in the postmortem, ensuring thorough documentation for a productive discussion during the postmortem meeting while remaining open to additional insights that may emerge.

ilert feature showcase: create postmortem with the help of AI
Generating postmortem using ilert AI

Step 6: Prepare Action Items

Now, it's crucial to determine steps to prevent similar issues in the future. While it may not always be feasible to completely eliminate the possibility of such incidents, focus on improving detection and mitigation measures for future events. This involves enhancing monitoring and alerting systems and developing strategies to reduce the severity or duration of incidents.

Create tickets for all proposed actions in your task management tool, ensuring each ticket includes sufficient context and a proposed direction. This will help the product owner prioritize the task and enable the assignee to carry it out efficiently. Each action item should be specific and actionable.

If any proposed actions require further discussion, add them to the postmortem meeting agenda. These could be proposals needing team validation or clarification. Discussing these items in the meeting will help determine the best course of action.

Explore all
Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.
Our Cookie Policy
We use cookies to improve your experience, analyze site traffic and for marketing. Learn more in our Privacy Policy.
Open Preferences
Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.
Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.