Kubernetes storage
Manage data persistence in Kubernetes

- K8s maintains the CSI specification that allows third parties to write "out-of-tree" plugins to provide external storage capabilities.
- For simplicity's sake, dynamic provisioning of volumes with
StorageClass
will be ignored and relevant values set to default if needed. - Though most use cases involve cloud-provided storage capabilities, this document will focus on self-managed storage systems instead.
- By construction, the file system exposed to a process by
container-runtime
is ephemeral and disappears once the container exits. - K8s provides specific objects to support data persistence for workloads and decouple the container file system from its lifecycle :
-
PersistentVolume
is a cluster resource : it exposes an underlying storage resource to the cluster as a directory. -
PersistentVolumeClaim
is a workload resource : a request by a container to mount a directory into its own file system.
-
- This enforces a clear separation of concerns :
- Storage resources provisioning is done outside of K8s by an administrator or a cloud storage provider.
- Storage resources consumption happens inside K8s when containers access volumes mounted in their file system.
- A
Pod
transparently consumes storage resources the same way it consumesNode
resources for processes execution.
The design proposal for persistent storage states that "Kubernetes makes no guarantees at runtime that the underlying storage exists or is available. High availability is left to the storage provider."
-
kube-controller-manager
runs a controller that continually scans the desired cluster state for newPersistentVolumeClaim
objects. - When a new claim is added, the controller tries to find a suitable
PersistentVolume
object to bind the claim to. - If no suitable volume exists, a new volume is dynamically provisioned using the claim's
StorageClass
if available. - Once a suitable volume is available, it is bound to the claim using a
ClaimRef
and mounted into the containers of the set of pods that initiated the claim, according to itsPodSpec
. - Once the set of pods disappears (workload container exit, etc), the volume is subjected to its reclaim policy :
-
Retain
: default for manually created volumes,PersistentVolume
still exists and un-provisioning has to be done manually. -
Delete
: default for dynamically created volumes, depends on CSI implementations support.
-
Note : the claim and the pods will remain Pending
and the workload container won't start until a suitable volume becomes available.
Careful provisioning of storage resources and crafting of manifest files should make the following operations unnecessary in most cases, however they remain available :
-
Pre-bind claims to specific volumes (overrides
kube-controller-manager
matchings of claims and volumes). - Modify the size of a volume or a claim (requires CSI implementation support and write access).
-
Volume
type
is not a field. Instead, thePersistentVolumeSpec
object includes dedicated fields for all supported storage providers. -
Cloud-provided storage require plugins that implement the CSI standard, so the volumes using them will always be of type
csi
. -
The following
PersistentVolume
types are part of the core K8s API :type storage source provider hostPath
Mount a local directory on a single node local
Mount a cluster-wide highly available directory nfs
Mount a directory from an external nfs server csi
Mount a directory from an "out-of-tree" volume plugin
- If a
Deployment
has a claim to alocal
volume,kube-scheduler
will place the pods according to the volume'snodeAffinity
. - Thus, it has to be carefully configured in order to ensure that the directories needed for mounts are actually available on nodes.
- The
nodeAffinity
field is immutable once the volume has been written to the cluster state. - An external static provisioner can be used to automate
local
volumes provisioning and deletion. -
local
volumes also require a storage class for accurate scheduling of pods.
-
nfs
volumes can be mounted from an NFS server running inside the cluster. - The preferred approach to set up an NFS server is to use NFS kernel features (detailed walkthrough here).
-
csi
volumes offer additional options for storage whether self-hosted or cloud-provided, through CSI plugins. - For instance, the Rook plugin uses Ceph (distributed file system) as its underlying storage system.
- Many cloud vendors also provide plugins that integrate their block storage service offer with K8s.
-
The "match" column indicates values that will be considered by
kube-controller-manager
when matching volumes and claims :attribute match usage capacity
Y Storage capacity for the current volume accessModes
Y Available access modes for the current volume storageClassName
Y Storage class for the current volume nodeAffinity
Y Node affinity, mandatory for local
volumes (see below)persistentVolumeReclaimPolicy
Retain
(static volumes) orDelete
(dynamic volumes)volumeMode
Filesystem
orBlock
to mount as a block device
-
Indicates permissions on a persistent volume once it is mounted :
access mode description ReadWriteOnce
Read / write access for all pods on a specific node ReadOnlyMany
Read access for all pods in the cluster ReadWriteMany
Read / write access for all pods in the cluster ReadWriteOncePod
Read / write access for a single specific pod
Notes :
- The storage provider has to support the mode in which the volume is mounted.
- A volume can only be mounted in a single mode even if it supports multiple modes.
- K8s do not enforce write restrictions on mounted volumes regardless of the mode.
- Worker nodes to which pods claiming a
local
volume will be scheduled have to be labeled to that effect. - The volume's
nodeAffinity
will then be declared using aNodeSelector
object. - When querying the cluster for nodes that support a specific
local
volume, K8s will perform a logical OR (||
) if multipleNodeSelectorTerm
are present in theNodeSelector
object (as opposed to label selectors based queries).