Kubernetes resources management
Fine tune resource allocation for workloads
- Initial assessments
- Resource types and allocation
- Node overheads
- Configure node
Allocatable - Key takeaways
- For simplicity's sake, we will consider resource allocation in the following context :
- Workloads will be running on a cluster of
Nnodes. - Every node has an identical CPU and memory capacity.
- Every node has been bootstrapped using an identical provisioning procedure.
- Every node runs the node components as well as the control plane components (the control plane is highly available).
- Workloads will be running on a cluster of
- This scenario makes the following recommendations applicable to single node clusters as well.
- Also, this document will focus on the one container per pod model and thus to resource allocation for containers instead of pods.
- Lastly, it is assessed that
cri-ohas been chosen as thecontainer-runtimefor the cluster nodes.
-
The main resources that may be allocated to a workload are host CPU and memory.
-
A host, be it a physical computer or a VM, has a limited available quantity of each.
-
In K8s terminology, those are called compute resources that are not part of the K8s API.
-
K8s supports the declaration of requests and limits for containers that may use compute resources :
resource base unit request declaration limit declaration CPU Millicore cpu: "100m"cpu: "250m"Memory Byte memory: "200Mi"memory: "400Mi" -
With respect to the node it is scheduled to, this container would be able to consume :
- Anywhere between 10% and 25% of the CPU cycles of one CPU core (or virtual core).
- Anywhere between 200 and 400 mebibytes of memory.
-
Compute resources requests are used by
kube-schedulerto make placement decisions :-
kubeletcomputes the current node'sCapacityandAllocatableusingcAdvisorand system calls. -
Allocatableis computed by subtracting the reserved resources from the node'sCapacity(see below). - This allows for regular updates of the cluster current state with available compute resources for each node.
-
kube-schedulerwill schedule a pod to a node if the container's resources requests do not exceed the node'sAllocatable.
-
-
Compute resources limits are used by
kubelet,cri-oand the kernel to keep running containers in line :- The kernel will use CPU throttling to preemptively restrict CPU access to containers that exceed their CPU limits.
- The kernel will use OOM kills to reactively terminate containers that exceed their memory limits.
Notes :
- If a compute resource request is omitted, it is initialized with the corresponding compute resource limit value.
- Linux nodes also support a specific type of compute resources called huge pages.
- However, it only impacts memory allocation at execution time and is only relevant in very advanced use cases.
-
Planning optimal resource allocation for a candidate workload requires quantifying the existing overheads.
-
Multiple overheads are already present once a host is able to function as a node in a K8s cluster :
type description overhead assessment Operating system Host is up and idle, all system daemons have started htopNode components services kubelet.serviceandcri-o.servicehave startedhtopNode components pods kube-proxypod has startedkubectlControl plane components All pods for the control plane components have started kubectlInstalled add ons All pods for cluster add-ons ( CoreDNS, pod network etc) have startedkubectl -
The following command will output compute resource requests, limits and node placement for every pod running in the cluster :
# read all running pods status across all namespaces, filter the output to extract
# resource allocation settings and current node, format for readability ...
kubectl get pods --all-namespaces -o json | \
jq -r '.items[] | select(.status.phase == "Running") | [
.metadata.name,
(.spec.containers[].resources.requests.cpu // "<nil>"),
(.spec.containers[].resources.limits.cpu // "<nil>"),
(.spec.containers[].resources.requests.memory // "<nil>"),
(.spec.containers[].resources.limits.memory // "<nil>"),
(.spec.priorityClassName // "<nil>"),
(.spec.nodeName // "<nil>")
] | @tsv' | column -t -N name,cpu_requests,cpu_limits,memory_requests,memory_limits,priorityClassName,node- The command
kubectl describe nodes <node-name>can also be used to report a node's currentCapacityandAllocatable(CPU, memory and pods) from the cluster's point of view.
-
The detailed formula that
kubeletuses to compute the current nodeAllocatableis as follows :Allocatable=Capacity-system-reserved-kube-reserved-eviction-treshold -
As a result,
system-reserved,kube-reservedandeviction-tresholdfor a node should be explicitly set in thekubeletconfiguration when bootstrapping or joining a cluster (for instance, using thekubeadmconfig). -
By doing so,
kubeletwill begin to proactively evict pods from the current node if their total compute resources consumption exceedsAllocatable, with respect to the pods priority classes. -
Conversely, K8s node components and OS daemons can exceed their declared resource consumption without any interference by
kubeletsince it will only enforceAllocatablefor pods, not for K8s or system resources reservations. -
This is a way to ensure that K8s components and critical system processes are prioritized over workloads for compute resources consumption, which is necessary for the node to continue to operate without failing.
-
It is possible to configure resources reservations by matching node capacity against recommendations used by cloud providers
-
See also this alternative guide.
-
Misconfiguration of resources consumption for system components, K8s components and workloads may lead to :
-
kube-schedulerfailing to schedule pods because of discrepancies between workloadsrequestand nodesAllocatable. -
kubeletneedlessly evicting pods because of errors in the configuration of nodesAllocatable.
-
-
Accurate configuration of compute resources
requestsandlimitsfor workloads is important :- It has to account for the compute resources that are available cluster wide.
-
kube-schedulerwill then make "sustainable" placement decisions depending on workloadrequestand nodesAllocatable. - Specifying a greater
limitfor a workload will allow for short bursts of activity without impacting the node (provided resources are available).