Kubernetes resources management
Fine tune resource allocation for workloads

- Initial assessments
- Resource types and allocation
- Node overheads
- Configure node
Allocatable
- Key takeaways
- For simplicity's sake, we will consider resource allocation in the following context :
- Workloads will be running on a cluster of
N
nodes. - Every node has an identical CPU and memory capacity.
- Every node has been bootstrapped using an identical provisioning procedure.
- Every node runs the node components as well as the control plane components (the control plane is highly available).
- Workloads will be running on a cluster of
- This scenario makes the following recommendations applicable to single node clusters as well.
- Also, this document will focus on the one container per pod model and thus to resource allocation for containers instead of pods.
- Lastly, it is assessed that
cri-o
has been chosen as thecontainer-runtime
for the cluster nodes.
-
The main resources that may be allocated to a workload are host CPU and memory.
-
A host, be it a physical computer or a VM, has a limited available quantity of each.
-
In K8s terminology, those are called compute resources that are not part of the K8s API.
-
K8s supports the declaration of requests and limits for containers that may use compute resources :
resource base unit request declaration limit declaration CPU Millicore cpu: "100m"
cpu: "250m"
Memory Byte memory: "200Mi"
memory: "400Mi"
-
With respect to the node it is scheduled to, this container would be able to consume :
- Anywhere between 10% and 25% of the CPU cycles of one CPU core (or virtual core).
- Anywhere between 200 and 400 mebibytes of memory.
-
Compute resources requests are used by
kube-scheduler
to make placement decisions :-
kubelet
computes the current node'sCapacity
andAllocatable
usingcAdvisor
and system calls. -
Allocatable
is computed by subtracting the reserved resources from the node'sCapacity
(see below). - This allows for regular updates of the cluster current state with available compute resources for each node.
-
kube-scheduler
will schedule a pod to a node if the container's resources requests do not exceed the node'sAllocatable
.
-
-
Compute resources limits are used by
kubelet
,cri-o
and the kernel to keep running containers in line :- The kernel will use CPU throttling to preemptively restrict CPU access to containers that exceed their CPU limits.
- The kernel will use OOM kills to reactively terminate containers that exceed their memory limits.
Notes :
- If a compute resource request is omitted, it is initialized with the corresponding compute resource limit value.
- Linux nodes also support a specific type of compute resources called huge pages.
- However, it only impacts memory allocation at execution time and is only relevant in very advanced use cases.
-
Planning optimal resource allocation for a candidate workload requires quantifying the existing overheads.
-
Multiple overheads are already present once a host is able to function as a node in a K8s cluster :
type description overhead assessment Operating system Host is up and idle, all system daemons have started htop
Node components services kubelet.service
andcri-o.service
have startedhtop
Node components pods kube-proxy
pod has startedkubectl
Control plane components All pods for the control plane components have started kubectl
Installed add ons All pods for cluster add-ons ( CoreDNS
, pod network etc) have startedkubectl
-
The following command will output compute resource requests, limits and node placement for every pod running in the cluster :
# read all running pods status across all namespaces, filter the output to extract
# resource allocation settings and current node, format for readability ...
kubectl get pods --all-namespaces -o json | \
jq -r '.items[] | select(.status.phase == "Running") | [
.metadata.name,
(.spec.containers[].resources.requests.cpu // "<nil>"),
(.spec.containers[].resources.limits.cpu // "<nil>"),
(.spec.containers[].resources.requests.memory // "<nil>"),
(.spec.containers[].resources.limits.memory // "<nil>"),
(.spec.priorityClassName // "<nil>"),
(.spec.nodeName // "<nil>")
] | @tsv' | column -t -N name,cpu_requests,cpu_limits,memory_requests,memory_limits,priorityClassName,node
- The command
kubectl describe nodes <node-name>
can also be used to report a node's currentCapacity
andAllocatable
(CPU, memory and pods) from the cluster's point of view.
-
The detailed formula that
kubelet
uses to compute the current nodeAllocatable
is as follows :Allocatable
=Capacity
-system-reserved
-kube-reserved
-eviction-treshold
-
As a result,
system-reserved
,kube-reserved
andeviction-treshold
for a node should be explicitly set in thekubelet
configuration when bootstrapping or joining a cluster (for instance, using thekubeadm
config). -
By doing so,
kubelet
will begin to proactively evict pods from the current node if their total compute resources consumption exceedsAllocatable
, with respect to the pods priority classes. -
Conversely, K8s node components and OS daemons can exceed their declared resource consumption without any interference by
kubelet
since it will only enforceAllocatable
for pods, not for K8s or system resources reservations. -
This is a way to ensure that K8s components and critical system processes are prioritized over workloads for compute resources consumption, which is necessary for the node to continue to operate without failing.
-
It is possible to configure resources reservations by matching node capacity against recommendations used by cloud providers
-
See also this alternative guide.
-
Misconfiguration of resources consumption for system components, K8s components and workloads may lead to :
-
kube-scheduler
failing to schedule pods because of discrepancies between workloadsrequest
and nodesAllocatable
. -
kubelet
needlessly evicting pods because of errors in the configuration of nodesAllocatable
.
-
-
Accurate configuration of compute resources
requests
andlimits
for workloads is important :- It has to account for the compute resources that are available cluster wide.
-
kube-scheduler
will then make "sustainable" placement decisions depending on workloadrequest
and nodesAllocatable
. - Specifying a greater
limit
for a workload will allow for short bursts of activity without impacting the node (provided resources are available).