Kubernetes resources management

Fine tune resource allocation for workloads

view on github

Resources management

Table of contents

  1. Initial assessments
  2. Resource types and allocation
  3. Node overheads
  4. Configure node Allocatable
  5. Key takeaways

Initial assessments

  • For simplicity's sake, we will consider resource allocation in the following context :
    • Workloads will be running on a cluster of N nodes.
    • Every node has an identical CPU and memory capacity.
    • Every node has been bootstrapped using an identical provisioning procedure.
    • Every node runs the node components as well as the control plane components (the control plane is highly available).
  • This scenario makes the following recommendations applicable to single node clusters as well.
  • Also, this document will focus on the one container per pod model and thus to resource allocation for containers instead of pods.
  • Lastly, it is assessed that cri-o has been chosen as the container-runtime for the cluster nodes.

Resource types and allocation

  • The main resources that may be allocated to a workload are host CPU and memory.

  • A host, be it a physical computer or a VM, has a limited available quantity of each.

  • In K8s terminology, those are called compute resources that are not part of the K8s API.

  • K8s supports the declaration of requests and limits for containers that may use compute resources :

    resource base unit request declaration limit declaration
    CPU Millicore cpu: "100m" cpu: "250m"
    Memory Byte memory: "200Mi" memory: "400Mi"
  • With respect to the node it is scheduled to, this container would be able to consume :

    • Anywhere between 10% and 25% of the CPU cycles of one CPU core (or virtual core).
    • Anywhere between 200 and 400 mebibytes of memory.
  • Compute resources requests are used by kube-scheduler to make placement decisions :

    • kubelet computes the current node's Capacity and Allocatable using cAdvisor and system calls.
    • Allocatable is computed by subtracting the reserved resources from the node's Capacity (see below).
    • This allows for regular updates of the cluster current state with available compute resources for each node.
    • kube-scheduler will schedule a pod to a node if the container's resources requests do not exceed the node's Allocatable.
  • Compute resources limits are used by kubelet, cri-o and the kernel to keep running containers in line :

    • The kernel will use CPU throttling to preemptively restrict CPU access to containers that exceed their CPU limits.
    • The kernel will use OOM kills to reactively terminate containers that exceed their memory limits.

Notes :

  • If a compute resource request is omitted, it is initialized with the corresponding compute resource limit value.
  • Linux nodes also support a specific type of compute resources called huge pages.
  • However, it only impacts memory allocation at execution time and is only relevant in very advanced use cases.

Node overheads

  • Planning optimal resource allocation for a candidate workload requires quantifying the existing overheads.

  • Multiple overheads are already present once a host is able to function as a node in a K8s cluster :

    type description overhead assessment
    Operating system Host is up and idle, all system daemons have started htop
    Node components services kubelet.service and cri-o.service have started htop
    Node components pods kube-proxy pod has started kubectl
    Control plane components All pods for the control plane components have started kubectl
    Installed add ons All pods for cluster add-ons (CoreDNS, pod network etc) have started kubectl
  • The following command will output compute resource requests, limits and node placement for every pod running in the cluster :

# read all running pods status across all namespaces, filter the output to extract
# resource allocation settings and current node, format for readability ...
kubectl get pods --all-namespaces -o json | \
jq -r '.items[] | select(.status.phase == "Running") | [
    .metadata.name,
        (.spec.containers[].resources.requests.cpu    // "<nil>"),
        (.spec.containers[].resources.limits.cpu      // "<nil>"),
        (.spec.containers[].resources.requests.memory // "<nil>"),
        (.spec.containers[].resources.limits.memory   // "<nil>"),
        (.spec.priorityClassName // "<nil>"),
        (.spec.nodeName // "<nil>")
] | @tsv' | column -t -N name,cpu_requests,cpu_limits,memory_requests,memory_limits,priorityClassName,node
  • The command kubectl describe nodes <node-name> can also be used to report a node's current Capacity and Allocatable (CPU, memory and pods) from the cluster's point of view.

Configure node Allocatable

  • The detailed formula that kubelet uses to compute the current node Allocatable is as follows :

    Allocatable = Capacity - system-reserved - kube-reserved - eviction-treshold

  • As a result, system-reserved, kube-reserved and eviction-treshold for a node should be explicitly set in the kubelet configuration when bootstrapping or joining a cluster (for instance, using the kubeadm config).

  • By doing so, kubelet will begin to proactively evict pods from the current node if their total compute resources consumption exceeds Allocatable, with respect to the pods priority classes.

  • Conversely, K8s node components and OS daemons can exceed their declared resource consumption without any interference by kubelet since it will only enforce Allocatable for pods, not for K8s or system resources reservations.

  • This is a way to ensure that K8s components and critical system processes are prioritized over workloads for compute resources consumption, which is necessary for the node to continue to operate without failing.

  • It is possible to configure resources reservations by matching node capacity against recommendations used by cloud providers

  • See also this alternative guide.


Key takeaways

  • Misconfiguration of resources consumption for system components, K8s components and workloads may lead to :

    • kube-scheduler failing to schedule pods because of discrepancies between workloads request and nodes Allocatable.
    • kubelet needlessly evicting pods because of errors in the configuration of nodes Allocatable.
  • Accurate configuration of compute resources requests and limits for workloads is important :

    • It has to account for the compute resources that are available cluster wide.
    • kube-scheduler will then make "sustainable" placement decisions depending on workload request and nodes Allocatable.
    • Specifying a greater limit for a workload will allow for short bursts of activity without impacting the node (provided resources are available).