Generative AI & Incident Management Guide

Learn how to leverage GenAI to enhance and upgrade your Incident Management Processes. Explore concrete examples across the Incident Response Cycle, from Preparation all the way to Resolution.

Download

Leveraging Generative AI to Transform Incident Management

Incident management and incident response is essential for companies in every industry. The introduction of Generative AI into this realm promises to enhance the efficiency, accuracy, and speed of detecting, analyzing, and resolving incidents. This guide explores how Large Language Models (LLMs) and Generative AI can improve the incident management processes. We will highlight practical examples by showcasing the GenAI capabilities of ilert and describe how we’re building AI features.

‍

This guide is organized around the incident response lifecycle's four crucial phases, illustrating examples how GenAI can be leveraged at each step:

Prepare

AI Assistants have changed how on-call schedules are made, making it easier to manage complex schedules and meet team needs. This is a big step towards smart, assisted scheduling tailored for on-call teams.
‍

Respond

Text embedding models revolutionize alert management by minimizing unnecessary alerts, helping teams focus on real issues. They deeply analyze alert meanings, efficiently sorting and identifying duplicates for a more streamlined process.

Communicate
‍
AI now automatically creates clear and brief updates during incidents. This improvement helps maintain consistent communication, frees up engineers to fix problems more efficiently, and improves the experience for everyone involved.

Learn

AI excels after incidents occur, helping create in-depth and precise incident post-mortems. This automation speeds up the process for organizations to learn and enhance their response strategies.

‍

AI for Incident Response

AI Assistant for On-call Scheduling

Creating an on-call schedule that balances team needs and ensures coverage is crucial for incident management. AI Assistants can streamline this process. By employing AI Assistants, complex scheduling requirements, such as follow-the-sun rotations, become manageable. An intuitive chat interface powered by an LLM can guide users through setting up their schedules, asking relevant questions to understand specific requirements and preferences. This AI-assisted approach simplifies scheduling, making it less time-consuming and more tailored to the unique dynamics of each team.

‍

We’re leveraging OpenAI's function calling feature within the Assistant, which blends conversational AI capabilities and programmatic function execution. This feature allows the Assistant to not only understand and process user inputs through natural language but also to execute functions based on this input, generating structured outputs such as JSON documents. Here's a breakdown of how we have utilized OpenAI Assistants API together with function calling in the context of creating an on-call schedule:

‍

Step 1: Understanding User Inputs

The process begins with the Assistant engaging the user in a conversation to gather all necessary details for the schedule. This involves asking about involved team members, rotation types, and on-call coverage. The Assistant's ability to parse natural language enables it to understand and categorize user responses into structured data that can be used in the next steps. The prompt for this conversation is provided in the Assistant’s instructions:

‍


As an AI assistant, your primary function is to aid users in generating schedules for ilert using the `create_recurring_schedule` function.

Include the function only if all the requirements are met, otherwise follow the steps below if requirements are missing.

### Basic Requirements
// list of requirements

### Handling Missing Requirements
// instructions on how to handle missing requirements

### JSON Output
// instructions how to output the JSON document

### Additional Guidelines
// additional instructions (e.g. language, how to deal with prompts beyond this scope, etc)

‍

Step 2: Executing Functions to Generate the Schedule

Once the input data is processed and organized, the Assistant leverages the function calling feature to execute a custom function designed for schedule creation. This function uses the prepared data to compute the on-call schedule, ensuring that all requirements and constraints are met. The culmination of this process is the generation of a JSON document that represents the finalized on-call schedule.

‍

	
  	{
        "name": "create_recurring_schedule",
        "description": "create a recurring on-call",
        "parameters": {
            "type": "object",
            "properties": {
                "name": {
                    "type": "string",
                    "description": "the name of the schedule"
                },
                "timezone": {
                    "type": "string",
                    "description": "the ISO timezone e.g. Europe/Berlin"
                },
                "type": {
                    "type": "string",
                    "enum": [
                        "RECURRING"
                    ],
                    "description": "the type of the schedule"
                },
                "scheduleLayers": {
                    "type": "array",
                    "items": {
                        "type": "object",
                        "properties": {
                            "name": {
                                "type": "string",
                                "description": "name of this layer"
                            },
                            "startsOn": {
                                "type": "string",
                                "description": "start of the layer"
                            },
                            "endsOn": {
                                "type": "string",
                                "description": "end of the layer, null if the layer has no end"
                            },
                            "users": {
                                "type": "array",
                                "items": {
                                    "type": "object",
                                    "properties": {
                                        "id": {
                                            "type": "number",
                                            "description": "the identifier of the user"
                                        }
                                    },
                                    "required": [
                                        "id"
                                    ]
                                }
                            },
                            "rotation": {
                                "type": "string",
                                "description": "ISO 8601 period, e.g. P1D"
                            },
                            "restrictionType": {
                                "type": "string",
                                "enum": [
                                    "TIMES_OF_WEEK"
                                ]
                            },
                            "restrictions": {
                                "type": "array",
                                "items": {
                                    "type": "object",
                                    "description": "Restrictions for each day",
                                    "properties": {
                                        "from": {
                                            "type": "object",
                                            "properties": {
                                                "dayOfWeek": {
                                                    "type": "string",
                                                    "enum": [
                                                        "MONDAY",
                                                        "TUESDAY",
                                                        "WEDNESDAY",
                                                        "THURSDAY",
                                                        "FRIDAY",
                                                        "SATURDAY",
                                                        "SUNDAY"
                                                    ]
                                                },
                                                "time": {
                                                    "type": "string",
                                                    "description": "start of the restriction e.g. 08:15 "
                                                }
                                            }
                                        },
                                        "to": {
                                            "type": "object",
                                            "properties": {
                                                "dayOfWeek": {
                                                    "type": "string",
                                                    "enum": [
                                                        "MONDAY",
                                                        "TUESDAY",
                                                        "WEDNESDAY",
                                                        "THURSDAY",
                                                        "FRIDAY",
                                                        "SATURDAY",
                                                        "SUNDAY"
                                                    ]
                                                },
                                                "time": {
                                                    "type": "string",
                                                    "description": "end of the parent restriction e.g. 17:15"
                                                }
                                            }
                                        }
                                    },
                                    "required": [
                                        "from",
                                        "to"
                                    ]
                                }
                            }
                        },
                        "required": [
                            "startsOn",
                            "users",
                            "rotation"
                        ]
                    }
                },
            },
            "required": [
                "name",
                "timezone",
                "type",
                "scheduleLayers"
            ]
        }
    }

This application of OpenAI's function calling feature showcases the Assistant's ability to bridge conversational input with programmatic output, enabling complex task automation like schedule creation directly within a conversational interface.

‍

Below is a sample conversation with ilert AI to generate a follow-the-sun schedule:

‍

Reduce Noise With Alert Deduplication

In this section, we’ll cover how text embedding models can be leveraged for automated alert deduplication to reduce alert noise in incident response.

What is alert deduplication?

Alert deduplication is the process of identifying multiple alerts that refer to the same underlying issue and consolidating them into a single alert to avoid redundancy. The primary goal of alert deduplication is to reduce noise and prevent the overwhelming of incident response teams with multiple notifications for the same issue.

Using Embeddings Similarity Search for Alert Deduplication

‍

There are many methods to implement alert deduplication. These methods range from simple rule-based systems to more complex machine learning models, each with its own set of advantages and applications. However, traditional machine learning techniques like clustering and classification often require a solid understanding of data science principles and involve a more hands-on approach by data scientists. This section introduces an approach based on vector embeddings and the use of pre-trained models, which makes it more accessible for individuals without deep data science expertise.

‍

To begin, we'll explore the necessary concepts for this method.

Vector embeddings are a mathematical representation of data in a high-dimensional space, where each point (or vector) represents a specific piece of data, such as a word, sentence, or an entire document. These embeddings capture the semantic relationships between data points, meaning that similar items are placed closer together in the vector space.

‍

This technique is widely used in natural language processing (NLP) and machine learning to enable computers to understand and process human language by converting text into a form that algorithms can work with.

‍

When you use ChatGPT, for example, your prompts are transformed into a series of numbers first (a vector). Similarly, we will transform alerts into vectors using an embedding model.

An embedding model is a type of machine learning model that learns to represent complex data, such as words, sentences, images, or graphs, as dense vectors of real numbers in a lower-dimensional space.

‍

The key idea behind embedding models is to capture the semantic relationships and features of the data in a way that positions similar items closer together in the embedding space.

‍

This transformation enables algorithms to perform mathematical operations on these embeddings, facilitating tasks like similarity comparison, clustering, and classification more effectively.

‍

// Input
"A sentence like this will be transformed into a series of (thousands) number"

// Output
[
	-0.006929283495992422,
  -0.005336422007530928,
  -4.547132266452536e-05,
  -0.024047505110502243,
  ...
	// thousands more numbers
]

‍

OK, but how can we use this for alert deduplication?

‍

We will transform alerts into vector embeddings using OpenAI’s text embedding model. By comparing these vectors, we identify and deduplicate alerts that are semantically similar, even if they do not match exactly on a textual level.

‍

The following sections details the steps involved in the process:

‍

Steps of alert processing into vector embeddings

‍

Step 1: Preprocessing Alerts

Normalization: Standardize the format of incoming alerts to ensure consistency. If you’re using an alerting system like ilert, which sits on top of multiple alert sources and observability tools, alerts are already normalized into a common format.
Cleaning:
- Remove irrelevant information or noise from alerts, such as timestamps (which might be unique to each alert but irrelevant for deduplication) or alert IDs.
- Use plain text and avoid markdown or JSON. This will not only reduce the number of tokens used, but will also avoid that the format will account for deduplication.

‍

Step 2: Vectorization / Generating Text Embeddings

Text Embeddings Model Selection: Choose an appropriate text embeddings model that can convert alert messages into vectors. Models like BERT, OpenAI’s text embeddings, or Sentence-BERT (specially designed for sentence embeddings) can be suitable.

Vectorization: Each incoming alert is transformed into a vector using the selected model and stored in a vector database. Models trained on large datasets, including natural language text, can capture a wide range of semantic meanings, making them suitable for encoding the information contained in alerts.

‍

Step 3: Deduplication Logic

Similarity Measurement: Use a similarity measure to compare the vectorized alerts. The similarity between embeddings is measured using metrics such as cosine similarity or Euclidean distance. These metrics quantify how close two embeddings are in the vector space, with closer embeddings indicating how similar they are in terms of their semantic content. OpenAI recommends using cosine similarity.

Threshold Setting: A threshold is set to determine when two alerts are considered duplicates. If the similarity score between an incoming alert and any existing alert exceeds this threshold, the alerts are considered duplicates. This threshold can be tuned based on the precision and recall requirements of your use case.

Deduplication and Clustering: When two alerts are identified as duplicates, they are consolidated into a single alert record, with a counter to indicate the number of duplicate alerts received.

Optional Summary Generation: Use a GenAI model to generate concise summaries for clusters of duplicate alerts. This step can aggregate the key information from multiple alerts into a single, easily digestible notification.

‍

Step 4: Feedback Loop

Implement a feedback mechanism where operators can mark false positives or missed duplicates. Use this feedback to fine-tune the similarity threshold. The screenshot below shows how you can enable intelligent alert grouping in the alert source settings.

Group alerts by content similarity or time frame in ilert

Advantages

The advantages of using embeddings for alert deduplication include:

‍

Semantic Understanding:

Unlike exact text matching, embeddings can capture the meaning of alerts, allowing for the deduplication of alerts that are semantically similar but not textually identical.

Flexibility:

This method can handle variations in alert wording or structure, making it robust against changes in alert formats or sources.

Scalability:

Embeddings and similarity searches can be efficiently implemented using vector databases and libraries, making this approach scalable to handle large volumes of alerts.

‍

Challenges and Considerations

Model Selection:

‍The effectiveness of embeddings for deduplication depends on the quality of the embedding model. Domain-specific models or fine-tuned models may offer better performance by capturing relevant nuances.

Threshold Tuning:

‍Determining the optimal threshold for deduplication requires balancing between false positives (incorrectly merging distinct alerts) and false negatives (failing to identify duplicates). This may require empirical testing and adjustment.

Continuous Learning:

‍Over time, the nature of alerts may evolve, necessitating updates to the model or reevaluation of the similarity threshold to maintain deduplication effectiveness.

AI-assisted Incident Communication

Leveraging LLMs for AI-assisted incident communication can automate updates, combining efficiency with a human touch. This approach ensures clear, understandable updates on incident status and resolutions, enhancing user experience while freeing engineers to focus on resolution efforts. Automated communications can adapt to context and audience, ensuring that updates are relevant and accessible, thereby streamlining the communication process during incident management.

‍

Below is a sample incident generation prompt.

You are a member of the incident response team of a company. Our platform offers various services, which may occasionally experience issues. When problems arise, it’s crucial to inform our users in a straightforward, reassuring manner.

### Your Task
Based on the brief notes provided, generate an incident report in JSON format. This report should include:- **Incident Status**: Use one of the following: Investigating, Identified, Monitoring, or Resolved.

- **Summary**: Craft a concise, one-sentence summary of the incident (no more than 250 characters). This should provide a clear snapshot of the issue at hand.
- **Description**: Write a detailed explanation in simple, non-technical language. Aim to reassure our users by explaining what happened and what we're doing about it.
- **Affected Services**: List any services that are not fully operational, along with their current state (Operational, Under Maintenance, Degraded, Minor Outage, Major Outage). Choose from the following services: {{#each services}} {{name}}{{#unless @last}},{{/unless}}{{else}}No services{{/each}}.

Limit this list to a maximum of 6 services.

### Guidelines

1. **Incident Status Explained**:   
 - **Investigating**: The issue's cause is currently unknown.   
 - **Identified**: The cause of the issue has been discovered.   
 - **Monitoring**: The issue has been addressed, and we are now closely monitoring the situation to ensure stability.   
 - **Resolved**: All services are back to normal operation.
 
 2. Ensure the incident report is clear and avoids technical jargon. Remember, our audience may not have a technical background.
 
 3. Present the information in a JSON format, adhering to the structure and content guidelines provided.### Notes for the Incident```{{input}}```

‍

We have fully integrated generating new incidents and incident updates into ilert, as shown below:

The example shows how a simple prompt payment apis down is automatically turned into a complete incident description, including generating a summary and message, setting incident status and selecting the affected services from the prompt and the services available in the service catalog.

‍

Benefits of AI-assisted Communication include

Consistency and Clarity:

AI ensures that all communications are consistent in style and tone, reducing confusion and maintaining professionalism.

Efficiency:

‍Freeing up engineers from writing updates allows them to concentrate on resolving the incident, speeding up the overall response time.

AI-assisted Postmortem Analysis

In incident management, the post-mortem process is crucial for teams to learn from past events and reduce future risks. Traditionally, creating these post-mortem reports has been a detailed and time-consuming task, requiring a lot of effort to create a clear story from different sources of data. However, the introduction of AI and Large Language Models (LLMs) brings a new approach to incident analysis, making it easier and more efficient to develop post-mortem documents.

‍

Leveraging AI for Efficient Analysis

AI-assisted post-mortems make analyzing incidents straightforward and powerful. They leverage chat conversations, alerts, and reports, turning them into a detailed event narrative. This data, showing every decision, observation, and action, becomes easy to navigate with AI, making what used to be a complex task simple and automatic.

‍

AI systems can quickly read and understand large amounts of text from conversations and logs. They focus on finding the main events, choices made, and actions agreed upon. Using advanced language models, AI can pinpoint important details like how the incident affects the business, what caused it, and the steps taken to fix it.

‍

Crafting a Coherent Narrative

LLMs excel in gathering and arranging information into a well-structured, brief report. This report efficiently covers the entire event, including what led to it, the sequence of events, and the efforts made to manage it. The AI provides a straightforward and clear narrative that lays a solid groundwork for further examination and conversation.

‍

Moreover, AI-assisted post-mortems can highlight potential areas for improvement. By analyzing the incident in the context of past events and known issues, AI can identify patterns and suggest actionable insights. These might include recommendations for process adjustments, training needs, or system enhancements to prevent future occurrences.

‍

Benefits Beyond Efficiency

The advantages of integrating AI into the post-mortem process extend beyond saving time. By ensuring a high level of accuracy and comprehensiveness, AI-generated reports offer several benefits:

Consistency:

‍AI ensures that every post-mortem report adheres to a consistent format and level of detail, facilitating easier comparison and trend analysis across incidents.

‍Objectivity:

By relying on data and analytics, AI minimizes the potential for bias or oversight, offering an objective account of events.

‍Depth of Insight:

With the ability to process vast datasets, AI can uncover insights that might be overlooked in manual analysis, providing deeper understanding of underlying issues.

Actionable Recommendations:

‍AI models can correlate incident data with historical trends to offer pragmatic suggestions for preventing future incidents.

‍

AI driven post mortem generation using ilert

‍

Below is a sample post mortem report that was generated entirely by ilert AI.

‍

Example post mortem report created by ilert AI

Developing AI Features

Ensuring Data Security with LLMs

To ensure the security and integrity of our systems while utilizing Large Language Models (LLMs), we adhere to stringent data handling and operational protocols. Below is an outline of our key practices:

All data is processed in data centers within the EU. We’re using the Microsoft Azure OpenAI service. The Azure OpenAI Service is fully controlled and operated by Microsoft; Microsoft hosts the OpenAI models in Microsoft’s Azure environment and the Service does NOT interact with any services operated by OpenAI.
We allow our customers to opt-out from using any of our AI services.

To protect our systems from common LLM vulnerabilities and risks we follow the following practices:
- All operations that are executed as a result of an interaction with an LLM are non-destructive and can be undone.
- The data we share with LLMs is never influenced by the output of the LLM. The data we share is fixed and part of the first prompt. Subsequent interactions with the LLM don’t change the amount of data we share. LLMs are not directly connected to any of our services or databases.
- We limit the data that is shared with the LLM to an absolute minimum.

Building Incident Response with LLMs

Our journey into integrating Large Language Models into our incident response platform has been a revelation in both capability and complexity. LLMs, by their nature, are nondeterministic black boxes, offering powerful capabilities while presenting unique challenges. One of the most profound lessons we've learned is that the real-world application of LLMs unfolds in ways that are impossible to fully anticipate during the development phase. Users engage with LLM-based applications with an unpredictability that demands adaptability and insight.

‍

In response to this, ilert has embraced a philosophy where real-world usage data becomes the cornerstone of our AI feature development and refinement process. Recognizing that user interactions provide the richest insights for improvement, we’re incorporating user feedback and have implemented an intermediate observability layer that collects telemetry data for every interaction with an LLM:

‍

1. User Feedback Collection: Simple yet effective, we solicit direct feedback from our users in the form of a thumbs up or down response. This immediate gauge of user satisfaction allows us to quickly identify and address areas needing refinement.

‍

2. Intermediate Observability Layer: To deepen our understanding and enhance the responsiveness of our AI features, we've established an intermediate layer that captures a telemetry data, including:

User Inputs:
‍
‍What queries or commands users are submitting to the system.

LLM Outputs:

‍The responses generated by the LLM, which are crucial for assessing the appropriateness and accuracy of the model's outputs.

‍Error Logging:

‍Beyond mere system failures, we track instances where the LLM's output, although generated successfully, leads to errors downstream due to being contextually off-target or otherwise inappropriate.

Token Usage Metrics:

‍Monitoring the total number of input and output tokens used helps us optimize our models for efficiency and cost-effectiveness.

LLM Response time:

‍We track and monitor the response times of LLMs. The most advanced models usually have a longer response time.

Prompt version and LLM model:
‍
For every interaction, we store which model and which version of our prompt was used.

Feedback Integration:
‍
‍The direct feedback from users is linked with specific interactions, allowing us to pinpoint and prioritize enhancements.

Model Selection Strategy

Our approach to model selection emphasizes starting with high-performance models to ensure the best possible results, prioritizing outcome quality over costs and response time. Initially, this strategy allows us to confirm the effectiveness of an AI feature.

‍

Subsequently, we consider transitioning to more cost-efficient models after validating the feature's success and gathering sufficient real-world usage data. This ensures that any move to a less powerful model does not compromise user experience.

‍

This comprehensive observability framework ensures that our AI features do not just exist in a vacuum but evolve in a symbiotic relationship with our user base. It acknowledges the dynamic nature of LLM applications and the necessity of an iterative development process informed by real-world application. At ilert, we believe that the key to building reliable, user-centric AI-driven systems lies in embracing the unpredictability of user interaction, leveraging it as a rich source of feedback and innovation.

‍

Optimizing LLM App Development

In the development and refinement of our LLM based features, we regularly update our prompts and experiment with different models and their parameters. To test changes systematically, we utilize a few best practices:

Leveraging JSON for Structured Testing

‍Where possible, we use OpenAI's JSON mode for verifying outputs. This structured approach allows us to bypass the less reliable practices of string comparisons or output string checks.

Utilizing a library for writing tests for LLMs
‍
‍We use promptfoo - a library that facilitates writing tests for LLMs - to establish a comprehensive suite of tests. These tests are crafted with carefully selected sample prompts alongside their expected outcomes. This method not only streamlines the testing process but also ensures that our tests remain robust and relevant over time.

Continuous Improvement and Experimentation
‍
‍Our commitment to excellence compels us to continually refine our prompts. By experimenting with various models and adjusting their parameters, we aim to achieve optimal performance. Systematic testing plays a crucial role in this ongoing process, allowing us to objectively assess and enhance our LLMs.

Insights and Outlook on GenAI Integration

The integration of Generative AI into incident management represents not just an evolution, but a revolution in how organizations prepare for, respond to, communicate during, and learn from incidents. The future of incident management is undeniably intertwined with the continuous innovation and application of AI technologies, guiding us from preparation through to resolution with heightened precision and efficiency.

‍

Using LLMs and GenAI Across The Incident Response Lifecycle

We looked at practical use cases for GenAI across the incident response life cycle:

‍

ilert's Integration Journey

ilert's journey of embedding Large Language Models (LLMs) and Generative AI into its platform underscores the critical importance of real-world application feedback. By focusing on user feedback, adding an intermediate observability layer, and fine-tuning AI models based on actual usage, ilert sets a benchmark in developing AI features that genuinely resonate with user needs and expectations.

‍

Conclusion

The integration of Generative AI technologies by ilert marks a revolutionary step in incident management. Through the entire incident response lifecycle, the capabilities of GenAI have been vividly demonstrated, showcasing a future where every phase is enhanced by Generative AI's efficiency and scalability.

Download the guide

Get a pdf version of the guide.

‍