In this blog post, I would like to share some research we have done recently. I was lucky enough to present the research this May at the DNS-OARC Workshop in Madrid. DNS-OARC stands for DNS Operations, Analysis, and Research Center. It's a leading research organization in the DNS field formed by individuals and companies in the DNS community. Last year, NS1 became a sponsor and a regular participant at the DNS-OARC.
You will find a link to the recording of my talk at the bottom of this post.
The Role of TCP in DNS
TCP is still a second-class citizen in the DNS protocol. The standard defines DNS on both the UDP and TCP transport layer, but the overwhelming majority of DNS traffic happens over UDP. From what we can observe, TCP counts just for about 3% of normal DNS queries. One reason for this is that TCP support in DNS software was initially optional. Most likely the primary reason to use UDP is that UDP is stateless and, therefore, cheaper for both clients and servers. Unlike TCP, there is no need to establish a connection with UDP. This means shorter latency times and a better user experience for the clients.
However, this situation is changing. First of all, TCP is required for zone transfers. Additionally, DNSSEC responses don’t always fit in the UDP size limit. Last, but not least, the DNS community is working on standardizing TLS over DNS to address privacy issues. While you can find a number of resolvers that don't support it, TCP support in modern DNS software is essential.
Why Investigate DNS over TCP?
We are currently refactoring the part of our platform responsible for serving DNS to the clients. One of the goals of this project is to ensure excellent TCP support.
There has been little research devoted to DNS over TCP and, therefore, we have a limited grasp of when and how resolvers use it. However, there are ways to handle some of the limitations of TCP. For instance, the latency can be reduced if the resolvers use TCP Fast Open and reuse established TCP connections. Also, TCP has its benefits, such as TCP Congestion Control and built-in protections that prevent it from being exploited for use in reflection attacks.
In doing this research, we wanted to learn if these mechanisms already take place and in general get a better understanding of resolver behavior for the purpose of making our platform more resolver-compatible and resilient to attacks.
Data Source for the Research
Initially, we collected samples of traffic from our Managed DNS platform. However, the amount of TCP traffic was not significant to draw any meaningful conclusions. Fortunately, we have access to a much larger data set through our association with DNS-OARC.
One of the efforts organised by DNS-OARC is DITL (Day in The Life of the Internet). Organizations participating in DITL willingly share full packet captures of DNS traffic from their servers for a period of 48 hours. This data is then available for DNS-OARC members to analyse.
I ran my all my experiments on a sample from DITL data collected on April 5-7, 2016. Those dates were the most recent available at the time of the analysis. I tried to make the sample diverse by picking DNS servers with different purposes (root servers, TLD servers, a DNS server by run a RIR, and also one AS112 server), different geographic locations, and a mix of servers with traffic volume on both the high end and low end. My sample contained about 67 million DNS queries in about 85 million TCP sessions.
Conclusions from the Research
I'm grateful to the DNS-OARC community for sharing the data and allowing me to do the research. This effort answered some of my questions and also created a lot of new ones. Here are some of the questions I was able to answer:
Is TCP used only as a fallback protocol? Mostly, yes, with the exception of Google Public DNS. 80.70% of all DNS queries in the sample were sent within a TCP session that was only used for a single query. Of the remaining TCP sessions, those used to answer at least two queries, 99.96% were initiated by resolvers of Google Public DNS.
Do the clients reuse existing TCP sessions efficiently? Yes, sometimes. However, we are only talking about Google Public DNS. In the sample, I've seen TCP sessions opened for minutes where the mean time between queries was about 10-20 seconds. It's difficult to answer how long the longest TCP sessions were, because the data sample is divided into files by 5-minute intervals and my analysis couldn't pair connections across these files.
What is the resolvers' policy on keeping the connection open? I don't know, but I believe the answer will be very interesting. For instance, I found two E-root servers receiving a similar amount of TCP traffic. One server was located in Atlanta, Georgia and the other one in New York, NY. In Atlanta, the TCP sessions were often used to send up to 12 queries per connection, while in New York, all TCP sessions were used to send just a single query and then closed.
Can TCP perform better than UDP for DNS traffic? Likely, when there are a lot of DNS queries to deliver to a single DNS server. I haven't investigated the use of TCP Congestion Control in the sample. However, I have seen packet retransmits, which could save some CPU power to the servers if they did some heavy computation for each answer (e.g. DNSSEC online signing). I have also seen the clients send multiple DNS queries within a single packet. This is usually an effect of Nagle's Algorithm that can buffer multiple small messages and send them at once to reduce the overhead of packet headers and effectively increase network throughput.
Some of the new questions that arose from this analysis:
Why are about 28% of all TCP sessions closed without sending a single query?
Are there network failure conditions where TCP would perform better?
What is the retransmit ratio of TCP compared to UDP?
Are the queries sent in a single TCP session grouped if they are related?
I already have the answer for the first question: Why are about 28% of all TCP sessions closed without sending a single query? Initially, I was quite surprised to see such a high number of sessions without a single query. During the discussion after the talk, I realized that I hadn't checked if the TCP sessions were fully established in my analysis. If the TCP handshakes were incomplete, that could add to the number of TCP sessions closed without sending a query. Incomplete connections can be the result of network transmission errors, an invalid network configuration, or can be caused deliberately by various monitoring tools.
I encourage everyone interested to validate my results and follow up with your own research. I am happy to hear your feedback and to hear what you discover in your research. Drop me an email at firstname.lastname@example.org.