Kubernetes storage

Manage data persistence in Kubernetes

view on github

Table of contents

  1. Initial assessments
  2. Main concepts
  3. Persistent volume types
  4. Persistent volume attributes

Initial assessments

  • K8s maintains the CSI specification that allows third parties to write "out-of-tree" plugins to provide external storage capabilities.
  • For simplicity's sake, dynamic provisioning of volumes with StorageClass will be ignored and relevant values set to default if needed.
  • Though most use cases involve cloud-provided storage capabilities, this document will focus on self-managed storage systems instead.

Main concepts

Volumes and claims

  • By construction, the file system exposed to a process by container-runtime is ephemeral and disappears once the container exits.
  • K8s provides specific objects to support data persistence for workloads and decouple the container file system from its lifecycle :
    • PersistentVolume is a cluster resource : it exposes an underlying storage resource to the cluster as a directory.
    • PersistentVolumeClaim is a workload resource : a request by a container to mount a directory into its own file system.
  • This enforces a clear separation of concerns :
    • Storage resources provisioning is done outside of K8s by an administrator or a cloud storage provider.
    • Storage resources consumption happens inside K8s when containers access volumes mounted in their file system.
    • A Pod transparently consumes storage resources the same way it consumes Node resources for processes execution.

The design proposal for persistent storage states that "Kubernetes makes no guarantees at runtime that the underlying storage exists or is available. High availability is left to the storage provider."

Volume lifecycle

  • kube-controller-manager runs a controller that continually scans the desired cluster state for new PersistentVolumeClaim objects.
  • When a new claim is added, the controller tries to find a suitable PersistentVolume object to bind the claim to.
  • If no suitable volume exists, a new volume is dynamically provisioned using the claim's StorageClass if available.
  • Once a suitable volume is available, it is bound to the claim using a ClaimRef and mounted into the containers of the set of pods that initiated the claim, according to its PodSpec.
  • Once the set of pods disappears (workload container exit, etc), the volume is subjected to its reclaim policy :
    • Retain : default for manually created volumes, PersistentVolume still exists and un-provisioning has to be done manually.
    • Delete : default for dynamically created volumes, depends on CSI implementations support.

Note : the claim and the pods will remain Pending and the workload container won't start until a suitable volume becomes available.

Additional operations on volumes

Careful provisioning of storage resources and crafting of manifest files should make the following operations unnecessary in most cases, however they remain available :


Persistent volume types

Volume type

  • Volume type is not a field. Instead, the PersistentVolumeSpec object includes dedicated fields for all supported storage providers.

  • Cloud-provided storage require plugins that implement the CSI standard, so the volumes using them will always be of type csi.

  • The following PersistentVolume types are part of the core K8s API :

    type storage source provider
    hostPath Mount a local directory on a single node
    local Mount a cluster-wide highly available directory
    nfs Mount a directory from an external nfs server
    csi Mount a directory from an "out-of-tree" volume plugin

Local volumes

  • If a Deployment has a claim to a local volume, kube-scheduler will place the pods according to the volume's nodeAffinity.
  • Thus, it has to be carefully configured in order to ensure that the directories needed for mounts are actually available on nodes.
  • The nodeAffinity field is immutable once the volume has been written to the cluster state.
  • An external static provisioner can be used to automate local volumes provisioning and deletion.
  • local volumes also require a storage class for accurate scheduling of pods.

CSI volumes

  • csi volumes offer additional options for storage whether self-hosted or cloud-provided, through CSI plugins.
  • For instance, the Rook plugin uses Ceph (distributed file system) as its underlying storage system.
  • Many cloud vendors also provide plugins that integrate their block storage service offer with K8s.

Persistent volume attributes

  • The "match" column indicates values that will be considered by kube-controller-manager when matching volumes and claims :

    attribute match usage
    capacity Y Storage capacity for the current volume
    accessModes Y Available access modes for the current volume
    storageClassName Y Storage class for the current volume
    nodeAffinity Y Node affinity, mandatory for local volumes (see below)
    persistentVolumeReclaimPolicy Retain (static volumes) or Delete (dynamic volumes)
    volumeMode Filesystem or Block to mount as a block device

Access modes

  • Indicates permissions on a persistent volume once it is mounted :

    access mode description
    ReadWriteOnce Read / write access for all pods on a specific node
    ReadOnlyMany Read access for all pods in the cluster
    ReadWriteMany Read / write access for all pods in the cluster
    ReadWriteOncePod Read / write access for a single specific pod

Notes :

  • The storage provider has to support the mode in which the volume is mounted.
  • A volume can only be mounted in a single mode even if it supports multiple modes.
  • K8s do not enforce write restrictions on mounted volumes regardless of the mode.

Node affinity

  • Worker nodes to which pods claiming a local volume will be scheduled have to be labeled to that effect.
  • The volume's nodeAffinity will then be declared using a NodeSelector object.
  • When querying the cluster for nodes that support a specific local volume, K8s will perform a logical OR (||) if multiple NodeSelectorTerm are present in the NodeSelector object (as opposed to label selectors based queries).