Applications can be fickle-minded beasts. They can keep running against all odds but can also crash and burn down when you least expect them to.

It is not for nothing that software developers dread the infamous late-night support calls.

kubernetes liveness probe failing

Applications always go down when you don’t want them to. Moreover, when the developers muster their courage and log in to their systems to investigate what went wrong, they find it was a trivial connection issue.

All that had to be done was restart the application.

Of course, none of it matters. You’ve already been woken up from a good sleep or forced to quit whatever interesting stuff you were doing in your spare time.

In an ideal world, you want your deployments to stay up and running automatically.

You want your deployments to remain healthy at all times.

And above all else, you want all of this to happen without any manual intervention.

Developers often wish they had a magic stick to accomplish this!

Incidentally, Kubernetes is that magic stick.

Kubernetes provides facilities to perform regular health checks on the pods that are running in our cluster and take appropriate action.

In this post, I will explain how to set up Kubernetes Liveness Probe in a step-by-step manner so that you have a better chance of avoiding late-night support calls.

In case you are new to Kubernetes, do check out this detailed post about Kubernetes Pods.

1 – How Kubernetes performs Health Checks?

When we create unmanaged pods, a cluster node is selected to run the pod.

Running a pod means running the containers associated with that pod.

Kubernetes then monitors these containers and automatically restarts them if they fail for some reason. Just like magic!

info

INFO

Unmanaged pods are those pods that don’t have backing replication controllers, replica sets or deployments supporting them. More on them in a later post.

This magic is performed by the kubelet installed on the node.

It is the kubelet that is responsible for running the containers and to keep them running as long as the pod exists. If the container’s main process crashes, the kubelet will restart the container.

This self-healing is performed solely by the kubelet on the node hosting the pod. The Kubernetes Control Plane and its components running on the master nodes have no part in this process.

Of course, there is a catch!

If a particular node fails, the kubelet running on that node also dies with it. And if the kubelet is gone, the unmanaged pods on the node are lost and won’t be replaced with new ones. The Kubernetes Control Plane had no information about these pods.

This is an issue but solving it is a matter for another post.

2 – Default health check is not perfect!

Before that, there is another issue with the current setup.

If our application has a bug that causes it to crash every once in a while, the kubelet will restart it automatically. Even if we don’t do anything special in the app itself, running the app in a Kubernetes pod grants it miraculous self-healing abilities.

But application bugs aren’t the only issues that can happen.

Sometimes, apps can stop working without their process crashing. For example, a Java app with a memory leak will start throwing OutOfMemoryErrors, but the JVM process will continue running. Kubernetes will continue to think that the pod is working admirably and won’t try to restart it.

But developers are smart. It might occur to you that you could simply catch these types of errors within the application and exit the process when they occur. Kubernetes will then encounter that the container has failed and will restart it.

While this may work for some cases, it is still not fool-proof.

What about those situations when our application stops responding because it has fallen into an infinite loop or a deadlock?

If the process has deadlocked and is unable to serve requests, a simple process health check will continue to believe that your application is healthy. After all, the process is still running.

As you can see, a simple process check is insufficient to ensure a water-tight health check.

To really make sure that applications are restarted when they are no longer working as expected, you must check the application’s health from the outside and not depend on the app doing it internally.

In other words, if the app is not doing what it is built to do, it is not healthy. No loose ends and technicalities.

3 – What is the Kubernetes Liveness Probe?

To solve the problems discussed in the previous section, Kubernetes introduced health checks for determining application liveness.

The liveness health checks run application-specific logic (such as, loading a web page) to verify whether the application is actually running in a healthy manner.

Just a running application is not enough. It should also function as expected.

Since liveness health checks are specific to the application, developers have to define them in the pod manifest.

To support the concept of liveness in different types of applications, Kubernetes can probe a container using one of the three different mechanisms:

  • The HTTP GET probe performs a HTTP GET request on the container’s IP address, port and the path we specify. This type of probe is ideal for web applications and even REST APIs.
  • The TCP Socket probe tries to open a TCP connection to the specified port of the container. This type of probe is suitable for database pods.
  • Lastly, the Exec probe executes an arbitrary command inside the container and checks the command’s exit status code. If the status code is 0, the probe is considered successful. Exec probe is often useful for custom application validation logic that doesn’t fit neatly into a simple HTTP call.

4 – Configuring a Kubernetes Liveness Probe

Let us now configure a Kubernetes liveness probe for a pod and see how it works.

As a first step, we need to create the pod manifest file that also contains the details about the liveness probe.

Check out the below manifest file (basic-pod-demo.yaml) for creating such a pod.

apiVersion: v1
kind: Pod
metadata:
  name: pod-demo-health-check
spec:
  containers:
  - image: progressivecoder/nodejs-demo
    imagePullPolicy: Never
    name: hello-service
    livenessProbe:
      httpGet:
        path: /
        port: 3000

Under the containers section, we have a dedicated section for livenessProbe. For this example, we have used the httpGet probe. We are also asking Kubernetes to probe the root path (/) on port 3000.

In order to simulate the scenario of a failing application, we will deploy some special logic in our demo NodeJS app.

Check the code below:

const express = require('express');

const app = express();

let requestCount = 0;

app.get('/', (req, res) => {
    requestCount++;
    if (requestCount > 2) {
        res.status(500).send("The app is not well. Please restart!")
    }
    res.send("Hello World from our Kubernetes Pod Demo")
})

app.listen(3000, () => {
    console.log("Listening to requests on Port 3000")
})

Basically, the idea is that after more than two requests, the application would start returning a HTTP 500 status code.

Any HTTP status code other than 200 or 300 is considered an error. It means something is wrong with your application. This will cause the liveness probe to fail and Kubernetes will have to destroy and restart our container.

Once the above preparations are done, we can build a Docker image of our NodeJS app and create the pod using kubectl apply command.

As soon as the pod starts up, you would find that it fails.

Describing the pod right after the failure occurs, you may be able to see a response like below:

Name:         pod-demo-health-check
Namespace:    default
Priority:     0
Node:         docker-desktop/192.168.65.4
Start Time:   Mon, 31 Oct 2022 12:54:24 +0530
Labels:       <none>
Annotations:  <none>
Status:       Running
IP:           10.1.0.19
IPs:
  IP:  10.1.0.19
Containers:
  hello-service:
    Container ID:   docker://313647f86192db6432637eaf3dc4a72f47c81c6a879e6f2695eea78de4e3606b
    Image:          progressivecoder/nodejs-demo
    Image ID:       docker://sha256:22ee8c17f674c9c5814839234b3a30f830da8a7c9eb1e2bad98fb19d1dba0a25
    Port:           <none>
    Host Port:      <none>
    State:          Running
      Started:      Mon, 31 Oct 2022 12:58:35 +0530
    Last State:     Terminated
      Reason:       Error
      Exit Code:    1
      Started:      Mon, 31 Oct 2022 12:57:45 +0530
      Finished:     Mon, 31 Oct 2022 12:58:35 +0530
    Ready:          True
    Restart Count:  5
    Liveness:       http-get http://:3000/ delay=0s timeout=1s period=10s #success=1 #failure=3
    Environment:    <none>
    Mounts:
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-f46ql (ro)
Conditions:
  Type              Status
  Initialized       True 
  Ready             True 
  ContainersReady   True 
  PodScheduled      True 
Volumes:
  kube-api-access-f46ql:
    Type:                    Projected (a volume that contains injected data from multiple sources)
    TokenExpirationSeconds:  3607
    ConfigMapName:           kube-root-ca.crt
    ConfigMapOptional:       <nil>
    DownwardAPI:             true
QoS Class:                   BestEffort
Node-Selectors:              <none>
Tolerations:                 node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
                             node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events:
  Type     Reason     Age                   From               Message
  ----     ------     ----                  ----               -------
  Normal   Scheduled  4m21s                 default-scheduler  Successfully assigned default/pod-demo-health-check to docker-desktop
  Normal   Killing    111s (x3 over 3m31s)  kubelet            Container hello-service failed liveness probe, will be restarted
  Normal   Pulled     110s (x4 over 4m21s)  kubelet            Container image "progressivecoder/nodejs-demo" already present on machine
  Normal   Created    110s (x4 over 4m20s)  kubelet            Created container hello-service
  Normal   Started    110s (x4 over 4m20s)  kubelet            Started container hello-service
  Warning  Unhealthy  81s (x10 over 3m51s)  kubelet            Liveness probe failed: HTTP probe failed with statuscode: 500

Pay special attention to the section about containers where we can see the Last State as Terminated. Also, the Events section at the bottom tries to capture the various activities that have occurred in this pod’s lifecycle.

When a container is killed by the kubelet, a completely new container is created. However, since our application is so fragile, the new container will also become unhealthy pretty quickly and kubelet will be forced to kill it.

Eventually, the pod will enter into the CrashLoopBackOff state. In this state, Kubernetes starts increasing the time between subsequent restarts hoping for someone to pay attention to this erratic container that keeps on failing.

5 – Kubernetes Liveness Probe Properties

Apart from the ones we specified in the pod manifest, there are also some other properties we can configure for our liveness probe to make it more useful.

See below example:

initialDelaySeconds: 5
timeoutSeconds: 1
periodSeconds: 10
failureThreshold: 3
  • The initialDelaySeconds configures an initial delay before the prober will start probing the container after it has started.
  • timeoutSeconds configures the time the prober should wait for a response from the health endpoint.
  • The periodSeconds sets the time between subsequent probes.
  • Lastly, the failureThreshold configures how many times the probe should be tried before Kubernetes gives up and ends the container’s suffering.

If we don’t set a proper value for initialDelaySeconds, the prober will immediately get to work and will most probably end up failing. This is because the app isn’t ready to start receiving requests.

If the number of failures exceeds the failure threshold, Kubernetes will restart the container before it’s even able to start responding to requests properly. This can continue in a loop and you might wonder why your application is acting so weird.

6 – What makes an ideal liveness probe?

Though you can set the liveness probe to check any endpoint and it should mostly be fine, an ideal probe should do a little bit more.

  • Ideally, you should configure the probe to perform requests on a specific URL path such as /health and have the app perform an internal status check of all the critical components running inside the app. Think of it as a health check-up where a doctor can’t pronounce a patient as healthy or unhealthy just by checking the pulse. In one of the projects I worked on, we routinely checked whether the application had a proper connection to the database before marking the probe as successful.
  • Make sure the special health endpoint does not require authentication. Otherwise, the probe will always fail and your container will keep suffering in the void of endless death and revival.
  • Liveness probes should not use too many computational resources and also, should not take too long to complete. The probe’s CPU time is counted as part of the container’s CPU time quota. Therefore, a heavy-weight liveness probe can take precious CPU time from the main application functionality.
  • Don’t bother implementing retry loops in your health endpoints. You can configure the retries using failureThreshold.

Conclusion

Is there any developer who doesn’t want a healthy lifestyle?

I bet, no one.

But unless your deployment systems don’t have some sort of self-healing capabilities, chances are that incessant support calls won’t let you lead a healthy lifestyle.

To do so, think about leveraging the power of Kubernetes liveness probes for application health checks. They are remarkably easy to implement and highly configurable.

Even making your liveness probe simply check the root path of your application can cause wonders in reducing un-necessary issues. Moreover, they can also result in lesser downtime and a better user experience.

Want to learn more Kubernetes concepts? Check out this post on Kubernetes Replication Controller.

Categories: BlogKubernetes

Saurabh Dashora

Saurabh is a Software Architect with over 12 years of experience. He has worked on large-scale distributed systems across various domains and organizations. He is also a passionate Technical Writer and loves sharing knowledge in the community.

0 Comments

Leave a Reply

Your email address will not be published. Required fields are marked *