Kubernetes is hard, but lets make monitoring and alerting for Kubernetes simple!
At iLert we are creating architectures composed of microservices and serverless functions that scale massively and seamlessly to guarantee our customers uninterrupted access to our services. As many others in the industry we are relying on Kubernetes when it comes to the orchestration of our services. Responding fast to any kind of incident is an absolute must for our uptime platform as every second counts and we like to reduce complexity wherever possible, iLert’s Kubernetes agent and incident actions help us to respond to issues extremely fast.
As of today there are many solutions for collecting kubernetes cluster metrics like Prometheus and turning them into graphs and alerts with the help of Grafana. These are great solutions with lots of features and contributions, but there are some problems that go hand in hand with them:
First, the complexity of such monitoring systems means that our goal (to get an alert if a problem occurs) cannot be achieved if one of the monitoring components itself fails. Let’s take a closer look. Suppose I want to be notified if my service or my kubernetes node has reached the memory limit, what do I need for this? The minimum we need is Prometheus, Grafana, kube-state-metrics and node exporter running in my cluster. Then I need to create a Grafana dashboard with the metrics I am interested in and individual alerting conditions for each service, because the memory used for each is different. It is far from easy and adds a whole layer of services that require additionall monitoring. Now if I want to be notified if one of my monitoring system components fails, I need to use HA solutions for it like Cortex, which again adds a whole additional level of complexity to my system.
Besides added complexity and maintenance, the resources used by monitoring systems often exceed the resources consumed by business services (especially in smaller companies or start-ups). This might be normal if we want to achieve transparency in resource usage and performance of our services or if we need to visualize business metrics. But if we just want to get an alert notification regarding a standard problem in our cluster e.g. POD down, it is clearly too much hassle.
The open source community is very active and we are very grateful for them, but this leads to the fact that we get several updates a month and of course we want to use them and we install them on our cluster over and over again. Unfortunately these updates don’t always go smoothly and we have to take the time to deal with the consequences of updates.
At last if a problem occurs, we need to act very quickly so that the customers are not impacted. This is not always easy, especially if you have been asleep for a couple of hours in the middle of the night and need to react asap on an incoming alert, you really want to keep any distraction and friction on a minimum. Besides having to log-in and searching for issues another problem could be the way the monitoring system collects metrics and how the alerts based on aggregated metrics work. For example, if I want to be notified when my service has been terminated (e.g. a bug which causes an NPE or our favorite OOMKilled). Prometheus collects metrics with a certain frequency. Usually it is once every 15 seconds, so in the worst case we lose those 15 seconds if a problem occurs right after Prometheus has once again collected data. After that Grafana processes the data usually every minute and runs a window of last 5 minutes in order to cause an alarm or not. That means in the best case we get an alarm after about a minute that our critical service has been terminated, in some cases after 5 or 10 minutes. Why waste so much time when we can get the information directly from Kubernetes in the same second and react much faster.
So let’s see how we can make our lives easier and separate alerts from metrics with iLert Kubernetes Agent.
You will need:
Before we start, let’s first understand how it works. In the following example, I want to be notified about problems in my cluster via SMS, Voice or push notifications and respond as quickly as possible to fix the problem.
ilert-kube-agent is a service that listens to the Kubernetes API server and generates incidents about the health state of the pods and the nodes. The agent detects the most common problems in the kubernetes cluster and does not require additional configuration. However, it is possible to pass a simple configuration if for example you don’t want to receive a specific type of alert. As soon as a problem is detected, the agent creates an incident in iLert about it. Next comes the reaction to the incident, in this case using a Lambda function, in order to quickly fix the potential problem - without requiring another context switch.
In this example I will use the helm setup. If you’re considering an installation with a simple yaml manifest or Terraform or Serverless (Lambda), you can refer to our instructions here
Finished! Now Kubernetes events will create incidents in iLert.
for the sake of this demonstration we are using AWS Lambda, you may use any other externally triggerable resource
In order to respond to an incident with incident actions in our kubernetes cluster, we need to create a Lambda connector and incident action for our Kubernetes alert source in iLert. The first thing to do is to create a Lambda function in AWS. For our demonstration purposes, I have created this repository, which contains a simple Lambda function with the utils to scale our deployment or statefulset.
We are now able to react to each kubernetes incident with a quick incident action from within iLert.
Now let’s see our setup in practice. In this example we get an alert that our service uses and increased amount of memory, probably because it has to process more traffic than usual. I want to spread the load off this service as soon as I get a notification about this problem and analyse the issue afterwards.
As you can see I used the incident action that we created before to solve the problem, right from my smartphone.
Checking the status of our service in Kubernetes, we see that it has more replicas now.
Of course not all problems are solved by scaling or rolling back, but response time is always critical to a successful business.
Our experience with different sized clusters and monitoring systems shows that you should not always rely on a single solution, even a very popular and well-established one. Solving a day-to-day problem in the shortest and most efficient way can make your life a lot easier.
A better way to be on-call for your team.