Kubernetes storage
Manage data persistence in Kubernetes
- K8s maintains the CSI specification that allows third parties to write "out-of-tree" plugins to provide external storage capabilities.
- For simplicity's sake, dynamic provisioning of volumes with
StorageClasswill be ignored and relevant values set to default if needed. - Though most use cases involve cloud-provided storage capabilities, this document will focus on self-managed storage systems instead.
- By construction, the file system exposed to a process by
container-runtimeis ephemeral and disappears once the container exits. - K8s provides specific objects to support data persistence for workloads and decouple the container file system from its lifecycle :
-
PersistentVolumeis a cluster resource : it exposes an underlying storage resource to the cluster as a directory. -
PersistentVolumeClaimis a workload resource : a request by a container to mount a directory into its own file system.
-
- This enforces a clear separation of concerns :
- Storage resources provisioning is done outside of K8s by an administrator or a cloud storage provider.
- Storage resources consumption happens inside K8s when containers access volumes mounted in their file system.
- A
Podtransparently consumes storage resources the same way it consumesNoderesources for processes execution.
The design proposal for persistent storage states that "Kubernetes makes no guarantees at runtime that the underlying storage exists or is available. High availability is left to the storage provider."
-
kube-controller-managerruns a controller that continually scans the desired cluster state for newPersistentVolumeClaimobjects. - When a new claim is added, the controller tries to find a suitable
PersistentVolumeobject to bind the claim to. - If no suitable volume exists, a new volume is dynamically provisioned using the claim's
StorageClassif available. - Once a suitable volume is available, it is bound to the claim using a
ClaimRefand mounted into the containers of the set of pods that initiated the claim, according to itsPodSpec. - Once the set of pods disappears (workload container exit, etc), the volume is subjected to its reclaim policy :
-
Retain: default for manually created volumes,PersistentVolumestill exists and un-provisioning has to be done manually. -
Delete: default for dynamically created volumes, depends on CSI implementations support.
-
Note : the claim and the pods will remain Pending and the workload container won't start until a suitable volume becomes available.
Careful provisioning of storage resources and crafting of manifest files should make the following operations unnecessary in most cases, however they remain available :
-
Pre-bind claims to specific volumes (overrides
kube-controller-managermatchings of claims and volumes). - Modify the size of a volume or a claim (requires CSI implementation support and write access).
-
Volume
typeis not a field. Instead, thePersistentVolumeSpecobject includes dedicated fields for all supported storage providers. -
Cloud-provided storage require plugins that implement the CSI standard, so the volumes using them will always be of type
csi. -
The following
PersistentVolumetypes are part of the core K8s API :type storage source provider hostPathMount a local directory on a single node localMount a cluster-wide highly available directory nfsMount a directory from an external nfs server csiMount a directory from an "out-of-tree" volume plugin
- If a
Deploymenthas a claim to alocalvolume,kube-schedulerwill place the pods according to the volume'snodeAffinity. - Thus, it has to be carefully configured in order to ensure that the directories needed for mounts are actually available on nodes.
- The
nodeAffinityfield is immutable once the volume has been written to the cluster state. - An external static provisioner can be used to automate
localvolumes provisioning and deletion. -
localvolumes also require a storage class for accurate scheduling of pods.
-
nfsvolumes can be mounted from an NFS server running inside the cluster. - The preferred approach to set up an NFS server is to use NFS kernel features (detailed walkthrough here).
-
csivolumes offer additional options for storage whether self-hosted or cloud-provided, through CSI plugins. - For instance, the Rook plugin uses Ceph (distributed file system) as its underlying storage system.
- Many cloud vendors also provide plugins that integrate their block storage service offer with K8s.
-
The "match" column indicates values that will be considered by
kube-controller-managerwhen matching volumes and claims :attribute match usage capacityY Storage capacity for the current volume accessModesY Available access modes for the current volume storageClassNameY Storage class for the current volume nodeAffinityY Node affinity, mandatory for localvolumes (see below)persistentVolumeReclaimPolicyRetain(static volumes) orDelete(dynamic volumes)volumeModeFilesystemorBlockto mount as a block device
-
Indicates permissions on a persistent volume once it is mounted :
access mode description ReadWriteOnceRead / write access for all pods on a specific node ReadOnlyManyRead access for all pods in the cluster ReadWriteManyRead / write access for all pods in the cluster ReadWriteOncePodRead / write access for a single specific pod
Notes :
- The storage provider has to support the mode in which the volume is mounted.
- A volume can only be mounted in a single mode even if it supports multiple modes.
- K8s do not enforce write restrictions on mounted volumes regardless of the mode.
- Worker nodes to which pods claiming a
localvolume will be scheduled have to be labeled to that effect. - The volume's
nodeAffinitywill then be declared using aNodeSelectorobject. - When querying the cluster for nodes that support a specific
localvolume, K8s will perform a logical OR (||) if multipleNodeSelectorTermare present in theNodeSelectorobject (as opposed to label selectors based queries).