A sneaky way to deploy stateful apps on Kubernetes
Hello, I’m Sanskar and I work in the engineering team at Fyle. We use AWS managed kubernetes to deploy our applications - which are primarily stateless workloads, i.e do not require persistent disk storage, BUT, recently I had to deploy a stateful application in our cluster, and this is the tale of how I ended up learning a bit more about kubernetes, against my will.
If you use kubernetes or are aware of its basic concepts like deployments and pods - this may be a good read for you. More so, if you use a managed kubernetes service provider like AWS EKS.
Building some context
Kubernetes deployments have ephemeral storage by default, which will be wiped if the pod using it terminates or crashes. That’s why it’s relatively simpler to deploy stateless applications - like nginx or any python server, any golang queue worker. That’s what God intended when They created kubernetes. But we live in an evolving world and human greed is unbound. So, eventually kubernetes API evolved to support deployment of stateful workloads - like databases and message brokers.
At Fyle, we’ve a couple of stateful applications which we deploy in our kubernetes cluster, and I’ll talk about the hiccups we had while trying to sneak them in with our largely stateless workloads.
For the sake of this blog, I’ll go ahead with an example stateful application which simply creates a single file to a directory in its file system.
The boring bit : Provisioning storage for deployments
Alright, if you want to provision permanent disk storage for an application, kubernetes provides two resources called Persistent Volume
(PV) and Persistent Volume Claim
(PVC). PV is the actual piece of disk storage provisioned to a cluster and PVC is a “claim” to reserve a part of that disk storage.
Let’s get hands on with it and deploy a simple stateful application to a cluster.
For our convenience, we already have a simple 4-node AWS EKS cluster running with kubernetes version 1.24, and disk storage for PVs provisioned using AWS EBS (Block storage). To reserve and use a piece of that PV, we define a PVC and use it in our deployment like below
Creating a PersistentVolumeClaim :
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: my-stateful-app-pvc
labels:
app: my-stateful-app
spec:
accessModes:
- ReadWriteOnce
resources:
requests:
storage: 1Gi
Using the PVC in our deployment (un-required details skipped using “…”):
apiVersion: apps/v1
kind: Deployment
metadata:
name: my-stateful-app
spec:
...
template:
...
spec:
containers:
- name: my-stateful-app
image: busybox
command: [ "/bin/sh", "-c", "--" ]
args: [ 'touch /data/file1.txt ; while true; do echo "Hellu"; sleep 2; done;' ]
volumeMounts:
- name: app-data-vol
mountPath: /data
volumes:
- name: app-data-vol
persistentVolumeClaim:
claimName: my-stateful-app-pvc
...
You can check out the complete yaml files here in this gist.
We’re doing two things in our deployment here
We’re creating file1.txt in the /data directory when our container starts.
We’ve created a volume called app-data-vol which will store all contents of /data directory in a persistent volume - i.e all contents of /data will survive pod crash/termination/restarts.
Going ahead and deploying this application on our cluster, we see it is live and running without any issues and we also see that file1.txt is created inside /data.
BUILD-MACHINE:~/sanskar $ kubectl apply -f my-stateful-app.yaml
persistentvolumeclaim/my-stateful-app-pvc unchanged
deployment.apps/my-stateful-app configured
BUILD-MACHINE:~/sanskar $ kubectl get po -w
NAME READY STATUS RESTARTS AGE
my-stateful-app-c578c7fcb-bq8bz 1/1 Running 0 2m33s
BUILD-MACHINE:~/sanskar $ kubectl exec -it my-stateful-app-c578c7fcb-bq8bz -- ls /data
file1.txt lost+found
Since the /data directory is mounted to a persistent volume, its content will remain, even if we delete the pod or the entire deployment and bring up a new one with the same persistent volume mounted. Let’s quickly verify that by deleting our deployment and redeploying it with one minor change - creating a the file with name file2.txt this time. The change/diff in yaml is simply as below
...
command: [ "/bin/sh", "-c", "--" ]
- args: [ 'touch /data/file1.txt; while true; do echo "Hellu"; sleep 2; done;' ]
+ args: [ 'touch /data/file2.txt; while true; do echo "Hellu"; sleep 2; done;' ]
resources:
requests:
...
Redeploying this change, the new pod comes up. On checking the contents of the /data directory we see that both the old and the new file is present
BUILD-MACHINE:~/sanskar $ kubectl exec -it my-stateful-app-c578c7fcb-bq8bz -- ls /data
file1.txt file2.txt lost+found
So this is how we simply deploy stateful applications on kubernetes using a persistent volume. It just works.Now that we’ve set some good context, let’s delve into some nuances and discuss some potential issues with our setup. Masala begins here.
The spicy bit : Issues with the boring bit
Although we’ve deployed a stateful application on our kubernetes cluster, there’s some shortcomings in our plain setup - which I figured out in a couple of weeks after deploying actual stateful workloads to production, of course.
Continuing with our example of my-stateful-app
, let’s update some deployment configuration and re-deploy it a couple of times - to simulate the real world scenario of deploying new changes. You can also try to scale the deployment to multiple replicas instead - all we want to do is to spin up new pods again and again.
BUILD-MACHINE:~/sanskar $ kubectl set env deployment/my-stateful-app DUMMY=<some-random-value>
BUILD-MACHINE:~/sanskar $ kubectl rollout restart deployment my-stateful-app
...
When we do this a few times, the universe works it’s magic, and one of these times you’ll see the new pod does not come up as expected, but is stuck in ContainerCreating state. When we describe the pod, we see an error in the pod events.
BUILD-MACHINE:~/sanskar $ kubectl get po -w
NAME READY STATUS RESTARTS AGE
my-stateful-app-5c895f588-9zxxj 1/1 Running 0 4m28s
my-stateful-app-5cd855bdb-nfn48 0/1 ContainerCreating 0 4m8s
BUILD-MACHINE:~/sanskar $ kubectl describe po my-stateful-app-5cd855bdb-nfn48
...
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Scheduled 22s default-scheduler Successfully assigned default/my-stateful-app-5cd855bdb-nfn48 to ip-192-170-68-146.ap-south-1.compute.internal
Warning FailedAttachVolume 22s attachdetach-controller Multi-Attach error for volume "pvc-e7bc9a21-d97e-47a2-9a50-4bd40d631571" Volume is already used by pod(s) my-stateful-app-5c895f588-9zxxj, my-stateful-app-7557d975b-mlh64
This error basically says that the new pod was unable to attach to the volume because the older pod is still using it. This is weird on first look, right ? We’ve seen the same deployment strategy working for us previously - usually the new pod comes up in 1/1 running state and then the old pod is terminated. So why did this error pop up randomly in one of the roll outs ?
The culprit(s)
Reading up about this “Multi-Attach error”, I understood that our persistent volume can only be attached to one node at a time, since we created the volume claim (i.e PVC) with access mode ReadWriteOnce here. But what’s the connection between a node and a volume ? Didn’t we link our volume to our container / “Pod” ?
Now is a good time to recall the fact that a kubernetes cluster is an abstraction over a bunch of compute instances (Nodes) which deploys your containerised application (Pods) on those compute instances, without you having to actually care about hows and whys of it.
When we attach a volume to a container, it does not magically link an actual storage volume to our container. What’s happening underneath is : a storage volume is bound to the “Node” where our pod is deployed, and the processes of our “Pod” are given Read/Write access to that volume. Essentially - a volume can be bound to one or multiple nodes underneath (depending on access mode specified in the volume claim i.e PVC)
Going back to our original issue : since our cluster has 4 nodes, kubernetes can chose to schedule the pods on any of those. When we see where our application’s pods are scheduled, we see that the old and new pod are scheduled on different nodes.
BUILD-MACHINE:~/sanskar $ kubectl describe po my-stateful-app-5c895f588-9zxxj | grep Node:
Node: ip-192-170-66-154.ap-south-1.compute.internal/192.170.66.154
BUILD-MACHINE:~/sanskar $ kubectl describe po my-stateful-app-5cd855bdb-nfn48 | grep Node:
Node: ip-192-170-68-146.ap-south-1.compute.internal/192.170.68.146
This is leading to the Multi-Attach error we see.
The fix … and more issues
Now that we understand how volumes, PVCs and nodes are related, we can modify our PVC to use access mode ReadWriteMany to make it work**.** But access mode is immutable once PVC is created. So let’s create a new PVC with access mode ReadWriteMany and use that in our deployment. Here’s the diff to our original yaml
diff --git a/my-stateful-app.yaml b/my-stateful-app.yaml
index 774bc7e2..831bbc8e 100644
--- a/my-stateful-app.yaml
+++ b/my-stateful-app.yaml
@@ -1,12 +1,12 @@
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
- name: my-stateful-app-pvc
+ name: my-stateful-app-pvc-2
labels:
app: my-stateful-app
spec:
accessModes:
- - ReadWriteOnce
+ - ReadWriteMany
resources:
requests:
storage: 1Gi
@@ -49,6 +49,6 @@ spec:
volumes:
- name: my-stateful-app-vol
persistentVolumeClaim:
- claimName: my-stateful-app-pvc
+ claimName: my-stateful-app-pvc-2
imagePullSecrets:
When we apply this modified yaml, we see that the new pod won’t come up and is stuck in pending state.
BUILD-MACHINE:~/sanskar $ kubectl get po -w
NAME READY STATUS RESTARTS AGE
my-stateful-app-5c895f588-9zxxj 1/1 Running 0 7m24s
my-stateful-app-689cd6bc7-tdv9l 0/1 Pending 0 21s
Doing the usual drill of describing resources, we see that the new PVC is stuck in pending state - it is unable to bind to a node. The error displayed in the PVC events is as below
BUILD-MACHINE:~/sanskar $ kubectl describe pvc my-stateful-app-pvc-2
...
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
...
Warning ProvisioningFailed 45s (x13 over 19m) ebs.csi.aws.com_ebs-csi-controller-5fd5966556-4hpqn_0556e150-7504-41a4-9fd2-3db5505b871b failed to provision volume with StorageClass "gp2": rpc error: code = InvalidArgument desc = Volume capabilities MULTI_NODE_MULTI_WRITER not supported. Only AccessModes[ReadWriteOnce] supported.
Our volume’s provisioning failed because AWS EBS - which is the type of block storage provisioned for our EKS cluster - does not support mounting a single volume to multiple nodes. So when we try to “claim” a volume using access mode ReadWriteMany, it simply fails to provision volume, leading to our volume and pod getting stuck in pending state.
Now this is an issue fundamental to our cluster setup. Fixing the root issue here requires changing our kubernetes cluster’s file system. Since majority of our application workloads at Fyle are of stateless nature, we never had to think about cluster storage deeply. Now when we’re trying to squeeze in a stateful application - this becomes evident.
Changing entire cluster’s storage provisioning for a couple of applications is not practical. So I started looking for ways to avoid it and still make stateful applications work smoothly on our cluster. This is when I stumbled upon the concept of nodeAffinity and podAffinity in kubernetes.
PodAffinity to rescue
The core problem statement now, as it evolved, was to figure out a way to schedule all pods of a deployment on a particular node - the node where our persistent volume is bound.
NodeAffinity is a pod scheduling construct which can help us constrain which nodes a pod can be scheduled onto. PodAffinity similarly, provides even more control as it can constrain scheduling of pods relative to other pods in the cluster
On first look, NodeAffinity seems to be the thing that we need, BUT using it has a huge downside - it binds infra components and application together. If we setup a new cluster or bring in new nodes to our cluster, we would always have to keep this node level spec - that’s a pain to maintain. Also we would need to specify the same nodeAffinity for our PVC too.
PodAffinity on the other hand, is relatively flexible - we can specify, where a pod should be scheduled relative to placement of other pods. There is no mention of node so we’re not breaching the abstraction of infra from application. This makes it a more sensible choice, and we’re gonna go ahead and use it. We can instruct kubernetes to always schedule the pods of our deployment my-stateful-app
on the same node where the previous pod was deployed.
diff --git a/my-stateful-app.yaml b/my-stateful-app.yaml
index 774bc7e2..219e5cfb 100644
--- a/my-stateful-app.yaml
+++ b/my-stateful-app.yaml
@@ -30,6 +30,16 @@ spec:
labels:
app: my-stateful-app
spec:
+ affinity: # always schedule the new pod on the node where old pod was scheduled (to prevent volume attach error)
+ podAffinity:
+ requiredDuringSchedulingIgnoredDuringExecution:
+ - topologyKey: kubernetes.io/hostname
+ labelSelector:
+ matchExpressions:
+ - key: app
+ operator: In
+ values:
+ - my-stateful-app
containers:
- name: my-stateful-app
image: busybox
Here, we’re basically saying that the pod should always be scheduled on a node (note that the topological construct we specified is kubernetes.io/hostname : which is unique for a node) where the kubernetes scheduler can find another pod having label app: my-stateful-app. Since our deployment strategy is RollingUpdate, old pod will always be terminated after the new pod is up and running. This is analogous to a baton race, where the old pod passes the “node info” to the new pod, which then takes over its place.
Once we delete the deployment, re-create it linking the same old PVC (the one which had access mode ReadWriteOnce) and the new new podAffinity changes, pod comes up running and our app works as expected.
BUILD-MACHINE:~/sanskar $ kubectl delete deployment my-stateful-app
deployment.apps "my-stateful-app" deleted
BUILD-MACHINE:~/sanskar $ kubectl apply -f my-stateful-app.yaml
persistentvolumeclaim/my-stateful-app-pvc unchanged
deployment.apps/my-stateful-app configured
BUILD-MACHINE:~/sanskar $ kubectl get po
NAME READY STATUS RESTARTS AGE
my-stateful-app-c578c7fcb-nm2x5 1/1 Running 0 47s
Also, we don’t need to worry about the initial deploy i.e when there is no “old pod”. When a PVC is created it is not bound to any node until the first consumer of that PVC is deployed. Kubernetes is smart to schedule the pods (of the pvc’s first consumer) to a node first, and then bind the volume to that node
This largely solves our issue and we can relatively safely deploy stateful deployments on kubernetes, but keep in mind if the node where your volume is bound ever runs low on memory, the new pod won’t be scheduled and will be stuck in pending state. This was not an issue for us as we safely over-provision compute capacity in our cluster relative to the requirement of our applications, and also have alerts and monitoring in place for node memory usage.
Summing up, we see that deploying stateful apps on kubernetes, isn’t so bad at all - its just a little more dance, depending upon your cluster setup OR which managed kubernetes platform you’re using. Provisions in the base kubernetes API (like PodAffinity) might come handy in dealing with quirks and edge cases that come in the way of deploying various kinds of application workloads, on different cloud providers.
Worst case, you can always go ahead and write your own operator 🙃 (Don’t)
Alright, thanks for reading up so far - I hope you learned something new here, or at least it was a worthwhile read for you. If you enjoyed reading this, or are generally intrigued by engineering at SaaS startups like Fyle, drop us a mail at careers@fylehq.com or, the lesser invasive option : tune in SaaS Engineering series by Siva. Sayonara :)