Kubernetes networking
Nodes, pods and services networking concepts

- For simplicity's sake, only the "one-container-per-pod" model will be considered to introduce networking concepts.
- K8s managed cluster networking features can be constrained to rely on IPv4 addressing only.
- Cluster nodes are part of the cluster state and are written to
kube-apiserver
asNode
objects. - Nodes can be written to the cluster state automatically by
kubelet
or manually by running commands. - After registration,
kubelet
will regularly update the node state, including its IP address which will be used for networking.
- All pods in a cluster share a single network (for instance
172.16.0.0/16
). - As a result, the pods network CIDR for a cluster limits the total quantity of pods it can manage.
- Pods can communicate with all other pods in the cluster, regardless of placement.
-
kubelet
can communicate with all the pods that run on the current node.
- Pod networking capabilities are provided by a pod network add-on that has to be installed as part of the cluster.
-
Multiple options are available to choose from. Some of them may include additional features (observability, security) or replace some core K8s components (ex.
kube-proxy
).
Notes :
container-runtime
providers as well as pod network add-on providers comply to the K8s CNI specification.- A pod network add-on leverage backends like VXLAN to provide overlay networking capabilities to the cluster.
- It provides a node-level address range for pods by allocating a specific subnet of the pod network to each node.
- As a result, the balance between network prefix and host identifier has to be considered when defining the pod network CIDR (
/16
fits most use cases).
- The
Service
object allows cluster-wide mapping of a set of pods to a single ip address and port. - This pattern enables interdependent cluster workloads A and B to be loosely coupled.
- Example : from workload A's point of view, individual pods from workload B are fungible and vice versa.
- Internally, the
Service
object also handles the distribution of incoming traffic to the individual pods it exposes.
- The
Ingress
object handles the routing of incoming HTTP / HTTPS traffic to specificService
backends. - The destination for incoming messages is resolved against a set of rules that are defined as part of its
spec
- Deploying
Ingress
to a K8s cluster requires the installation of an ingress controller (multiple options are available).
Note : for simplicity's sake, this document will focus on the vendor-specific features of the nginx ingress controller.
-
A service can be exposed inside or outside the cluster in different ways depending on its
type
:service type usage ClusterIP
Default: the service virtual IP is only reachable from within the cluster NodePort
Service is exposed on a node port, node IP used in lieu of virtual IP LoadBalancer
NodePort
+ configuration of a cloud load balancer bycloud-controller-manager
ExternalName
Expose an external API as a cluster service by mapping its hostname to a DNS name -
The attributes of the service are stored in a
ServiceSpec
object. The most important are :attribute usage type
Service type selector
Service traffic will be distributed to pods whose labels
match this selectorports
Array of service / pod port mappings - the service exposes the specified ports sessionAffinity
Configures session stickiness at the service level externalName
The hostname that CoreDNS
will return forExternalName
servicesexternalTrafficPolicy
How to distribute external traffic sent to public facing IPs (not applicable to ClusterIP
)internalTrafficPolicy
How to distribute cluster internal traffic sent to ClusterIP
-
More details on session stickiness configuration.
-
When a new service is added to the cluster desired state,
kube-controller-manager
does the following :- Scan the cluster for running pods whose
labels
matches the service'sselector
. - Distribute the matching pods to a set of
EndpointSlice
associated with the service. - Assign a single cluster-wide virtual ip address (
clusterIP
) to the service. - Update the current cluster state with the new information.
Note :
kube-controller-manager
manages theService
/EndpointSlice
mappings using labels and selectors. - Scan the cluster for running pods whose
-
Once the service is configured,
kube-proxy
does the following on each node :- Install local netfilter rules to intercept the incoming traffic for the service
clusterIP
andServicePort
. - The rules then route packets between the service
clusterIP
and the pods addresses in the selectedEndpointSlice
. - The default policy is to route traffic from services to pods regardless of the pods placement.
- Simplified example of auto-generated rules :
- Install local netfilter rules to intercept the incoming traffic for the service
# create a chain for handling service traffic on the local node
iptables -t nat -N K8S_SERVICE
# map the service clusterIP to the chain
iptables -t nat -A PREROUTING -d 172.17.0.1/16 -p tcp --dport 80 -j K8S_SERVICE
# add destination NAT rules to distribute incoming traffic to the pods (pods use a different address range ofc)
iptables -t nat -A K8S_SERVICE -m statistic --mode random --probability 0.3 -j DNAT --to-destination 172.16.0.1:8080
iptables -t nat -A K8S_SERVICE -m statistic --mode random --probability 0.3 -j DNAT --to-destination 172.16.0.2:8080
iptables -t nat -A K8S_SERVICE -j DNAT --to-destination 172.16.0.3:8080
# add source NAT rules to masquerade return traffic from the pods
iptables -t nat -A POSTROUTING -s 172.16.0.1/16 -p tcp --sport 8080 -j MASQUERADE
iptables -t nat -A POSTROUTING -s 172.16.0.2/16 -p tcp --sport 8080 -j MASQUERADE
iptables -t nat -A POSTROUTING -s 172.16.0.3/16 -p tcp --sport 8080 -j MASQUERADE
- Lastly,
CoreDNS
enables service discovery for the new service by updating its DNS records.
-
CoreDNS
maintains records for services and pods by default. - It creates A/AAAA records for FQDN names with the pattern :
<service>.<namespace>.svc.<cluster>.<domain>
. - When a pod requests name resolution using the
service
name only,CoreDNS
will match it against the following records :-
Service
records defined in the same namespace as the pod. -
Service
records defined in thedefault
namespace.
-
- Conversely, using the service FQDN for name resolution allows the pod to resolve any service in any namespace.
- As a result, it is best to either always use FQDN for name resolution (a service with the same name could exist in the
default
namespace), or not to create services (or best, nothing at all) in thedefault
namespace. - When the name resolution succeeds,
CoreDNS
returns theClusterIP
of the service.
Notes :
kubelet
configures/etc/resolve.conf
in each pod to point to thekube-dns
service (CoreDNS
).- The cluster domain is configurable and should not collide with public domains from the internet.
- K8s enables services discovery through environment variables as well. However, DNS resolution should always be preferred.
- Since pods are fungible, name resolution for pods is pointless and shouldn't be relied on in the majority of use cases.
EndpointSlice
are only sets of pods and have no IP address.- See here for a detailed description of
kube-proxy
auto generated rules and a walkthrough on how to inspect them.
-
Compliant to its model, K8s has decoupled the implementation of reverse proxying features from its consumption :
- Features implementation is provided by an ingress controller that runs on top of the cluster.
- Features are made available as cluster resources for consumption by
Ingress
objects which are native K8s objects.
-
Ingress
main features are :- Make backends (usually
ClusterIP
services) accessible from outside the cluster. - Provide fanout capabilities : route incoming traffic to multiple backends according to specific rules.
- Provide SSL / TLS termination or name-based virtual hosting capabilities if needed.
- Make backends (usually
-
Ingress controllers provide vendor-specific options that can be specified as annotations when creating
Ingress
objects. -
For advanced use cases, ingress controllers can be further customized using the following pattern :
- Run a customized instance of the ingress controller using a
Deployment
. - Create an
IngressClass
backed by the custom ingress controller through itsspec
. - Then, create
Ingress
objects whosespec
will reference the customIngressClass
throughingressClassName
.
- Run a customized instance of the ingress controller using a
Notes :
- Ingresses are only responsible for routing incoming traffic to
ClusterIP
services and have to be exposed usingNodePort
services. - The default
ingress-nginx
installation will automatically create anIngressClass
in thedefault
namespace. - That
IngressClass
is accessible cluster-wide and can be used to createIngress
objects that support the majority of use cases. - K8s recommends against exposing non HTTP / HTTPS workloads through
Ingress
(useNodePort
services instead). - As a result , using the "advanced" pattern should be avoided as much as possible.
- If
Ingress
proves too limited, run the customized reverse proxy as aDaemonset
and leverage cluster service discovery features. - If needed, configure the nginx
resolver
directive to point to thekube-dns
service :
# nginx.conf
http {
resolver kube-dns.kube-system.svc.cluster.local valid=30s;
# ...
}