Kubernetes networking

K8s networking

Initial assessments
Main concepts
Services
Ingress

Initial assessments

For simplicity's sake, only the "one-container-per-pod" model will be considered to introduce networking concepts.
K8s managed cluster networking features can be constrained to rely on IPv4 addressing only.

Main concepts

Node networking

Cluster nodes are part of the cluster state and are written to kube-apiserver as Node objects.
Nodes can be written to the cluster state automatically by kubelet or manually by running commands.
After registration, kubelet will regularly update the node state, including its IP address which will be used for networking.

Pod networking

All pods in a cluster share a single network (for instance 172.16.0.0/16).
As a result, the pods network CIDR for a cluster limits the total quantity of pods it can manage.
Pods can communicate with all other pods in the cluster, regardless of placement.
kubelet can communicate with all the pods that run on the current node.

Pod networking implementation

Pod networking capabilities are provided by a pod network add-on that has to be installed as part of the cluster.
Multiple options are available to choose from. Some of them may include additional features (observability, security) or replace some core K8s components (ex. kube-proxy).

Notes :

container-runtime providers as well as pod network add-on providers comply to the K8s CNI specification.
A pod network add-on leverage backends like VXLAN to provide overlay networking capabilities to the cluster.
It provides a node-level address range for pods by allocating a specific subnet of the pod network to each node.
As a result, the balance between network prefix and host identifier has to be considered when defining the pod network CIDR (/16 fits most use cases).

Services networking

The Service object allows cluster-wide mapping of a set of pods to a single ip address and port.
This pattern enables interdependent cluster workloads A and B to be loosely coupled.
Example : from workload A's point of view, individual pods from workload B are fungible and vice versa.
Internally, the Service object also handles the distribution of incoming traffic to the individual pods it exposes.

Ingress

The Ingress object handles the routing of incoming HTTP / HTTPS traffic to specific Service backends.
The destination for incoming messages is resolved against a set of rules that are defined as part of its spec
Deploying Ingress to a K8s cluster requires the installation of an ingress controller (multiple options are available).

Note : for simplicity's sake, this document will focus on the vendor-specific features of the nginx ingress controller.

Services

Service declaration

A service can be exposed inside or outside the cluster in different ways depending on its type :

service type	usage
`ClusterIP`	Default: the service virtual IP is only reachable from within the cluster
`NodePort`	Service is exposed on a node port, node IP used in lieu of virtual IP
`LoadBalancer`	`NodePort` + configuration of a cloud load balancer by `cloud-controller-manager`
`ExternalName`	Expose an external API as a cluster service by mapping its hostname to a DNS name

The attributes of the service are stored in a ServiceSpec object. The most important are :

attribute	usage
`type`	Service type
`selector`	Service traffic will be distributed to pods whose `labels` match this selector
`ports`	Array of service / pod port mappings - the service exposes the specified ports
`sessionAffinity`	Configures *session stickiness* at the service level
`externalName`	The hostname that `CoreDNS` will return for `ExternalName` services
`externalTrafficPolicy`	How to distribute external traffic sent to public facing IPs (not applicable to `ClusterIP`)
`internalTrafficPolicy`	How to distribute cluster internal traffic sent to `ClusterIP`

More details on session stickiness configuration.

Service creation

When a new service is added to the cluster desired state, kube-controller-manager does the following :
1. Scan the cluster for running pods whose labels matches the service's selector.
2. Distribute the matching pods to a set of EndpointSlice associated with the service.
3. Assign a single cluster-wide virtual ip address (clusterIP) to the service.
4. Update the current cluster state with the new information.
Note : kube-controller-manager manages the Service / EndpointSlice mappings using labels and selectors.
Once the service is configured, kube-proxy does the following on each node :
1. Install local netfilter rules to intercept the incoming traffic for the service clusterIP and ServicePort.
2. The rules then route packets between the service clusterIP and the pods addresses in the selected EndpointSlice.
3. The default policy is to route traffic from services to pods regardless of the pods placement.
4. Simplified example of auto-generated rules :

# create a chain for handling service traffic on the local node
iptables -t nat -N K8S_SERVICE

# map the service clusterIP to the chain
iptables -t nat -A PREROUTING -d 172.17.0.1/16 -p tcp --dport 80 -j K8S_SERVICE

# add destination NAT rules to distribute incoming traffic to the pods (pods use a different address range ofc)
iptables -t nat -A K8S_SERVICE -m statistic --mode random --probability 0.3 -j DNAT --to-destination 172.16.0.1:8080
iptables -t nat -A K8S_SERVICE -m statistic --mode random --probability 0.3 -j DNAT --to-destination 172.16.0.2:8080
iptables -t nat -A K8S_SERVICE -j DNAT --to-destination 172.16.0.3:8080

# add source NAT rules to masquerade return traffic from the pods
iptables -t nat -A POSTROUTING -s 172.16.0.1/16 -p tcp --sport 8080 -j MASQUERADE
iptables -t nat -A POSTROUTING -s 172.16.0.2/16 -p tcp --sport 8080 -j MASQUERADE
iptables -t nat -A POSTROUTING -s 172.16.0.3/16 -p tcp --sport 8080 -j MASQUERADE

Lastly, CoreDNS enables service discovery for the new service by updating its DNS records.

Service discovery

CoreDNS maintains records for services and pods by default.
It creates A/AAAA records for FQDN names with the pattern : <service>.<namespace>.svc.<cluster>.<domain>.
When a pod requests name resolution using the service name only, CoreDNS will match it against the following records :
1. Service records defined in the same namespace as the pod.
2. Service records defined in the default namespace.
Conversely, using the service FQDN for name resolution allows the pod to resolve any service in any namespace.
As a result, it is best to either always use FQDN for name resolution (a service with the same name could exist in the default namespace), or not to create services (or best, nothing at all) in the default namespace.
When the name resolution succeeds, CoreDNS returns the ClusterIP of the service.

Notes :

kubelet configures /etc/resolve.conf in each pod to point to the kube-dns service (CoreDNS).
The cluster domain is configurable and should not collide with public domains from the internet.
K8s enables services discovery through environment variables as well. However, DNS resolution should always be preferred.
Since pods are fungible, name resolution for pods is pointless and shouldn't be relied on in the majority of use cases.
EndpointSlice are only sets of pods and have no IP address.
See here for a detailed description of kube-proxy auto generated rules and a walkthrough on how to inspect them.

Ingress

Compliant to its model, K8s has decoupled the implementation of reverse proxying features from its consumption :
- Features implementation is provided by an ingress controller that runs on top of the cluster.
- Features are made available as cluster resources for consumption by Ingress objects which are native K8s objects.
Ingress main features are :
- Make backends (usually ClusterIP services) accessible from outside the cluster.
- Provide fanout capabilities : route incoming traffic to multiple backends according to specific rules.
- Provide SSL / TLS termination or name-based virtual hosting capabilities if needed.
Ingress controllers provide vendor-specific options that can be specified as annotations when creating Ingress objects.
For advanced use cases, ingress controllers can be further customized using the following pattern :
1. Run a customized instance of the ingress controller using a Deployment.
2. Create an IngressClass backed by the custom ingress controller through its spec.
3. Then, create Ingress objects whose spec will reference the custom IngressClass through ingressClassName.

Notes :

Ingresses are only responsible for routing incoming traffic to ClusterIP services and have to be exposed using NodePort services.
The default ingress-nginx installation will automatically create an IngressClass in the default namespace.
That IngressClass is accessible cluster-wide and can be used to create Ingress objects that support the majority of use cases.
K8s recommends against exposing non HTTP / HTTPS workloads through Ingress (use NodePort services instead).
As a result , using the "advanced" pattern should be avoided as much as possible.
If Ingress proves too limited, run the customized reverse proxy as a Daemonset and leverage cluster service discovery features.
If needed, configure the nginx resolver directive to point to the kube-dns service :

# nginx.conf
http {
  resolver kube-dns.kube-system.svc.cluster.local valid=30s;
  # ...
}

K8s networking

Table of contents

Initial assessments

Main concepts

Node networking

Pod networking

Pod networking implementation

Services networking

Ingress

Services

Service declaration

Service creation

Service discovery

Ingress