How can I improve the performance of cluster DNS?

Any region where your DOKS cluster is located has multiple DigitalOcean DNS server nodes. These DNS servers are rate limiting where DigitalOcean puts rate limits per server per source IP address.

DOKS uses CoreDNS for in-cluster DNS management. Every cluster, by default, has two CoreDNS replicas to serve DNS requests. Pods reach out to the CoreDNS service for DNS queries which are translated to an endpoint. For public or out-of-cluster hostnames, CoreDNS relays requests to the upstream DigitalOcean DNS servers and the upstream DNS traffic is divided among those nodes. Additionally, pods running in the host network use the upstream DigitalOcean DNS servers by default. If your DOKS cluster gets a lot of DNS queries, you may run into issues due to the server quotas and notice UDP packet drop.

Prometheus metrics can provide insights of your cluster’s DNS traffic. Depending on the DNS profile, you can use the following strategies to tune the DNS performance of your cluster.

Enable DNS Caching

NodeLocal DNSCache enables you to run a DNS caching agent on every cluster node to cache DNS results. When a pod makes a DNS request, it first reaches out to the DNS caching agent on the same node. Doing so avoids DNAT rules and connection tracking, which reduces the average DNS lookup time and improves the cluster DNS resolution performance. If the record is not present, then the caching agent queries the CoreDNS service. For more information, see Using NodeLocal DNSCache in Kubernetes Clusters in the Kubernetes documentation.

To enable NodeLocal DNSCache, create a nodelocaldns.yaml manifest and specify your values, as described in the Configuration section of the Kubernetes documentation.

Additionally, you need to customize the DNS settings of your workloads to use the <node-local-address> of the NodeLocal DNSCache. This is required because DOKS-specific iptables rules prevent the DNS cache instances from serving requests in the default NodeLocal DNSCache configuration.

Assuming a <node-local-address> of 169.254.0.5, a pod’s manifest looks similar to the following:

    
        
            
apiVersion: v1
kind: Pod
metadata:
  name: client
spec:
  containers:
    - name: client
      image: my-org/my-image:v1.2.3
  dnsPolicy: "None"
  dnsConfig:
    nameservers:
       - 169.254.0.5
    searches: ["kube-system.svc.cluster.local", "svc.cluster.local", "cluster.local"]
    options:
      name: ndots
      value: "5"

        
    

The nameservers value must be set to the <node-local-address> configured into NodeLocal DNSCache. Additionally, the dnsPolicy value must be set to none to prevent merging in the default CoreDNS name server from the Kubernetes environment. Consequently, other default resolv.conf settings, such as searches and options must also be explicitly defined as shown above.

For more information on how to specify the dnsConfig field of the pod, see Pod’s DNS Config in the Kubernetes documentation.

Use Non-Shared Machine Type for Cluster

Machines with shared CPU are prone to increased UDP latency. If your cluster has high DNS traffic, you can add a new node pool with a dedicated CPU machine type to the cluster. Then delete the node pool which has the Basic nodes. All machine types, except the Basic nodes, are dedicated CPU Droplets.

Alternatively, you can select a dedicated CPU machine type when creating a new cluster if you expect the cluster to have high DNS traffic.

Scale Out DNS Traffic

Scaling our DNS traffic can improve DNS performance. You can manually increase the replica count of the CoreDNS service from the default value. For example, kubectl scale deployments.apps -n kube-system coredns --replicas=4 increases the number of replicas to 4 from 2. Use kubectl describe deployment coredns -n=kube-system to check the increase in the number of replicas.

Alternatively, you can autoscale the CoreDNS service as described in the Enable DNS horizontal autoscaling section of the Kubernetes documentation.

Reduce DNS Traffic

If possible, reduce the DNS load your client applications send to the cluster.

Health checks often fail due to firewalls or misconfigured backend server software.
Launch an Init Container or run a DaemonSet.