Kubernetes networking
Nodes, pods and services networking concepts
- For simplicity's sake, only the "one-container-per-pod" model will be considered to introduce networking concepts.
- K8s managed cluster networking features can be constrained to rely on IPv4 addressing only.
- Cluster nodes are part of the cluster state and are written to
kube-apiserverasNodeobjects. - Nodes can be written to the cluster state automatically by
kubeletor manually by running commands. - After registration,
kubeletwill regularly update the node state, including its IP address which will be used for networking.
- All pods in a cluster share a single network (for instance
172.16.0.0/16). - As a result, the pods network CIDR for a cluster limits the total quantity of pods it can manage.
- Pods can communicate with all other pods in the cluster, regardless of placement.
-
kubeletcan communicate with all the pods that run on the current node.
- Pod networking capabilities are provided by a pod network add-on that has to be installed as part of the cluster.
-
Multiple options are available to choose from. Some of them may include additional features (observability, security) or replace some core K8s components (ex.
kube-proxy).
Notes :
container-runtimeproviders as well as pod network add-on providers comply to the K8s CNI specification.- A pod network add-on leverage backends like VXLAN to provide overlay networking capabilities to the cluster.
- It provides a node-level address range for pods by allocating a specific subnet of the pod network to each node.
- As a result, the balance between network prefix and host identifier has to be considered when defining the pod network CIDR (
/16fits most use cases).
- The
Serviceobject allows cluster-wide mapping of a set of pods to a single ip address and port. - This pattern enables interdependent cluster workloads A and B to be loosely coupled.
- Example : from workload A's point of view, individual pods from workload B are fungible and vice versa.
- Internally, the
Serviceobject also handles the distribution of incoming traffic to the individual pods it exposes.
- The
Ingressobject handles the routing of incoming HTTP / HTTPS traffic to specificServicebackends. - The destination for incoming messages is resolved against a set of rules that are defined as part of its
spec - Deploying
Ingressto a K8s cluster requires the installation of an ingress controller (multiple options are available).
Note : for simplicity's sake, this document will focus on the vendor-specific features of the nginx ingress controller.
-
A service can be exposed inside or outside the cluster in different ways depending on its
type:service type usage ClusterIPDefault: the service virtual IP is only reachable from within the cluster NodePortService is exposed on a node port, node IP used in lieu of virtual IP LoadBalancerNodePort+ configuration of a cloud load balancer bycloud-controller-managerExternalNameExpose an external API as a cluster service by mapping its hostname to a DNS name -
The attributes of the service are stored in a
ServiceSpecobject. The most important are :attribute usage typeService type selectorService traffic will be distributed to pods whose labelsmatch this selectorportsArray of service / pod port mappings - the service exposes the specified ports sessionAffinityConfigures session stickiness at the service level externalNameThe hostname that CoreDNSwill return forExternalNameservicesexternalTrafficPolicyHow to distribute external traffic sent to public facing IPs (not applicable to ClusterIP)internalTrafficPolicyHow to distribute cluster internal traffic sent to ClusterIP -
More details on session stickiness configuration.
-
When a new service is added to the cluster desired state,
kube-controller-managerdoes the following :- Scan the cluster for running pods whose
labelsmatches the service'sselector. - Distribute the matching pods to a set of
EndpointSliceassociated with the service. - Assign a single cluster-wide virtual ip address (
clusterIP) to the service. - Update the current cluster state with the new information.
Note :
kube-controller-managermanages theService/EndpointSlicemappings using labels and selectors. - Scan the cluster for running pods whose
-
Once the service is configured,
kube-proxydoes the following on each node :- Install local netfilter rules to intercept the incoming traffic for the service
clusterIPandServicePort. - The rules then route packets between the service
clusterIPand the pods addresses in the selectedEndpointSlice. - The default policy is to route traffic from services to pods regardless of the pods placement.
- Simplified example of auto-generated rules :
- Install local netfilter rules to intercept the incoming traffic for the service
# create a chain for handling service traffic on the local node
iptables -t nat -N K8S_SERVICE
# map the service clusterIP to the chain
iptables -t nat -A PREROUTING -d 172.17.0.1/16 -p tcp --dport 80 -j K8S_SERVICE
# add destination NAT rules to distribute incoming traffic to the pods (pods use a different address range ofc)
iptables -t nat -A K8S_SERVICE -m statistic --mode random --probability 0.3 -j DNAT --to-destination 172.16.0.1:8080
iptables -t nat -A K8S_SERVICE -m statistic --mode random --probability 0.3 -j DNAT --to-destination 172.16.0.2:8080
iptables -t nat -A K8S_SERVICE -j DNAT --to-destination 172.16.0.3:8080
# add source NAT rules to masquerade return traffic from the pods
iptables -t nat -A POSTROUTING -s 172.16.0.1/16 -p tcp --sport 8080 -j MASQUERADE
iptables -t nat -A POSTROUTING -s 172.16.0.2/16 -p tcp --sport 8080 -j MASQUERADE
iptables -t nat -A POSTROUTING -s 172.16.0.3/16 -p tcp --sport 8080 -j MASQUERADE- Lastly,
CoreDNSenables service discovery for the new service by updating its DNS records.
-
CoreDNSmaintains records for services and pods by default. - It creates A/AAAA records for FQDN names with the pattern :
<service>.<namespace>.svc.<cluster>.<domain>. - When a pod requests name resolution using the
servicename only,CoreDNSwill match it against the following records :-
Servicerecords defined in the same namespace as the pod. -
Servicerecords defined in thedefaultnamespace.
-
- Conversely, using the service FQDN for name resolution allows the pod to resolve any service in any namespace.
- As a result, it is best to either always use FQDN for name resolution (a service with the same name could exist in the
defaultnamespace), or not to create services (or best, nothing at all) in thedefaultnamespace. - When the name resolution succeeds,
CoreDNSreturns theClusterIPof the service.
Notes :
kubeletconfigures/etc/resolve.confin each pod to point to thekube-dnsservice (CoreDNS).- The cluster domain is configurable and should not collide with public domains from the internet.
- K8s enables services discovery through environment variables as well. However, DNS resolution should always be preferred.
- Since pods are fungible, name resolution for pods is pointless and shouldn't be relied on in the majority of use cases.
EndpointSliceare only sets of pods and have no IP address.- See here for a detailed description of
kube-proxyauto generated rules and a walkthrough on how to inspect them.
-
Compliant to its model, K8s has decoupled the implementation of reverse proxying features from its consumption :
- Features implementation is provided by an ingress controller that runs on top of the cluster.
- Features are made available as cluster resources for consumption by
Ingressobjects which are native K8s objects.
-
Ingressmain features are :- Make backends (usually
ClusterIPservices) accessible from outside the cluster. - Provide fanout capabilities : route incoming traffic to multiple backends according to specific rules.
- Provide SSL / TLS termination or name-based virtual hosting capabilities if needed.
- Make backends (usually
-
Ingress controllers provide vendor-specific options that can be specified as annotations when creating
Ingressobjects. -
For advanced use cases, ingress controllers can be further customized using the following pattern :
- Run a customized instance of the ingress controller using a
Deployment. - Create an
IngressClassbacked by the custom ingress controller through itsspec. - Then, create
Ingressobjects whosespecwill reference the customIngressClassthroughingressClassName.
- Run a customized instance of the ingress controller using a
Notes :
- Ingresses are only responsible for routing incoming traffic to
ClusterIPservices and have to be exposed usingNodePortservices. - The default
ingress-nginxinstallation will automatically create anIngressClassin thedefaultnamespace. - That
IngressClassis accessible cluster-wide and can be used to createIngressobjects that support the majority of use cases. - K8s recommends against exposing non HTTP / HTTPS workloads through
Ingress(useNodePortservices instead). - As a result , using the "advanced" pattern should be avoided as much as possible.
- If
Ingressproves too limited, run the customized reverse proxy as aDaemonsetand leverage cluster service discovery features. - If needed, configure the nginx
resolverdirective to point to thekube-dnsservice :
# nginx.conf
http {
resolver kube-dns.kube-system.svc.cluster.local valid=30s;
# ...
}