Kubernetes storage

K8s Storage

Initial assessments
Main concepts
Persistent volume types
Persistent volume attributes

Initial assessments

K8s maintains the CSI specification that allows third parties to write "out-of-tree" plugins to provide external storage capabilities.
For simplicity's sake, dynamic provisioning of volumes with StorageClass will be ignored and relevant values set to default if needed.
Though most use cases involve cloud-provided storage capabilities, this document will focus on self-managed storage systems instead.

Main concepts

Volumes and claims

By construction, the file system exposed to a process by container-runtime is ephemeral and disappears once the container exits.
K8s provides specific objects to support data persistence for workloads and decouple the container file system from its lifecycle :
- PersistentVolume is a cluster resource : it exposes an underlying storage resource to the cluster as a directory.
- PersistentVolumeClaim is a workload resource : a request by a container to mount a directory into its own file system.
This enforces a clear separation of concerns :
- Storage resources provisioning is done outside of K8s by an administrator or a cloud storage provider.
- Storage resources consumption happens inside K8s when containers access volumes mounted in their file system.
- A Pod transparently consumes storage resources the same way it consumes Node resources for processes execution.

The design proposal for persistent storage states that "Kubernetes makes no guarantees at runtime that the underlying storage exists or is available. High availability is left to the storage provider."

Volume lifecycle

kube-controller-manager runs a controller that continually scans the desired cluster state for new PersistentVolumeClaim objects.
When a new claim is added, the controller tries to find a suitable PersistentVolume object to bind the claim to.
If no suitable volume exists, a new volume is dynamically provisioned using the claim's StorageClass if available.
Once a suitable volume is available, it is bound to the claim using a ClaimRef and mounted into the containers of the set of pods that initiated the claim, according to its PodSpec.
Once the set of pods disappears (workload container exit, etc), the volume is subjected to its reclaim policy :
- Retain : default for manually created volumes, PersistentVolume still exists and un-provisioning has to be done manually.
- Delete : default for dynamically created volumes, depends on CSI implementations support.

Note : the claim and the pods will remain Pending and the workload container won't start until a suitable volume becomes available.

Additional operations on volumes

Careful provisioning of storage resources and crafting of manifest files should make the following operations unnecessary in most cases, however they remain available :

Pre-bind claims to specific volumes (overrides kube-controller-manager matchings of claims and volumes).
Modify the size of a volume or a claim (requires CSI implementation support and write access).

Persistent volume types

Volume type

Volume type is not a field. Instead, the PersistentVolumeSpec object includes dedicated fields for all supported storage providers.
Cloud-provided storage require plugins that implement the CSI standard, so the volumes using them will always be of type csi.

The following PersistentVolume types are part of the core K8s API :

type	storage source provider
`hostPath`	Mount a local directory on a single node
`local`	Mount a cluster-wide highly available directory
`nfs`	Mount a directory from an external nfs server
`csi`	Mount a directory from an "out-of-tree" volume plugin

Local volumes

If a Deployment has a claim to a local volume, kube-scheduler will place the pods according to the volume's nodeAffinity.
Thus, it has to be carefully configured in order to ensure that the directories needed for mounts are actually available on nodes.
The nodeAffinity field is immutable once the volume has been written to the cluster state.
An external static provisioner can be used to automate local volumes provisioning and deletion.
local volumes also require a storage class for accurate scheduling of pods.

NFS volumes

nfs volumes can be mounted from an NFS server running inside the cluster.
The preferred approach to set up an NFS server is to use NFS kernel features (detailed walkthrough here).

CSI volumes

csi volumes offer additional options for storage whether self-hosted or cloud-provided, through CSI plugins.
For instance, the Rook plugin uses Ceph (distributed file system) as its underlying storage system.
Many cloud vendors also provide plugins that integrate their block storage service offer with K8s.

Persistent volume attributes

The "match" column indicates values that will be considered by kube-controller-manager when matching volumes and claims :

attribute	match	usage
`capacity`	Y	Storage capacity for the current volume
`accessModes`	Y	*Available* access modes for the current volume
`storageClassName`	Y	Storage class for the current volume
`nodeAffinity`	Y	Node affinity, mandatory for `local` volumes (see below)
`persistentVolumeReclaimPolicy`		`Retain` (static volumes) or `Delete` (dynamic volumes)
`volumeMode`		`Filesystem` or `Block` to mount as a block device

Access modes

Indicates permissions on a persistent volume once it is mounted :

access mode	description
`ReadWriteOnce`	Read / write access for all pods on a specific node
`ReadOnlyMany`	Read access for all pods in the cluster
`ReadWriteMany`	Read / write access for all pods in the cluster
`ReadWriteOncePod`	Read / write access for a single specific pod

Notes :

The storage provider has to support the mode in which the volume is mounted.
A volume can only be mounted in a single mode even if it supports multiple modes.
K8s do not enforce write restrictions on mounted volumes regardless of the mode.

Node affinity

Worker nodes to which pods claiming a local volume will be scheduled have to be labeled to that effect.
The volume's nodeAffinity will then be declared using a NodeSelector object.
When querying the cluster for nodes that support a specific local volume, K8s will perform a logical OR (||) if multiple NodeSelectorTerm are present in the NodeSelector object (as opposed to label selectors based queries).