DNS RTT analysis reinforced

ADAM Report 1/2021

M. Andziński, M. Quiros Segovia, L. Lhotka

26 May 2021

Abstract

This report is a follow-up to our previous research on passive analysis of round-trip time (RTT) for DNS transactions. The new method described herein utilizes randomly sent truncated DNS replies for increasing the DNS-over-TCP traffic volume that is used for estimating RTT. The method was experimentally deployed on a production DNS server for ccTLD .CZ. Quantitative results are compared to normal traffic in terms of TCP traffic volume, RTT coverage as well as empirical distributions and mean values of RTT.

Introduction

Network latency is an important parameter that influences end-user’s experience for most interactive Internet applications, notably web browsing. In virtually all cases, a share of the overall latency budget can be attributed to DNS. Browser vendors invested a lot of effort into minimizing latency, and DNS latency in particular, at the client side [2].

Operators of top-level DNS domains (TLD), such as CZ.NIC, may improve the latency for their clients – DNS resolvers – either by tuning BGP parameters that control anycast routing, or by deploying new DNS servers in regions with suboptimal latency. The latter is often not a trivial undertaking, in terms of both expenses and logistics. Therefore, it has to be carefully planned based on the assessment of actual latency.

In 2019 we presented a method for passive analysis of DNS round-trip time (RTT) [1] based on estimating RTT from the TCP handshake between a DNS client and server. The advantage of this approach is that RTT estimates can be obtained directly from the existing DNS traffic. On the other hand, a potential drawback is that the prevalent transport protocol for DNS is UDP, and TCP is used only for a small fraction of DNS traffic. Moreover, this “natural” TCP traffic may be biased in the sense that the originating resolvers are not a representative sample of all resolvers in terms of RTT. One can expect, for example, to observe TCP connections mostly from DNSSEC-validating resolvers, and worldwide penetration of DNSSEC is still far from homogeneous [4].

One possibility for convincing more resolvers to send DNS queries over TCP is provided by the DNS protocol itself: it is the TC (TrunCation) bit in the DNS header, which is normally used by a DNS server for indicating to the resolver that the response exceeds the maximum size permitted for UDP, and is therefore truncated. An idea of artificially increasing the volume of DNS-over-TCP traffic by replying with random TC responses to a small fraction of DNS queries was proposed as a logical continuation of our study [1], and we first shared it with the CENTR community members interested in doing TCP-based RTT analysis. The same approach was also proposed later by a group of DNS researchers who conducted similar passive RTT analysis for the ccTLD of The Netherlands [3].

Knot DNS has the optional noudp module that was originally intended as a simpler alternative to the RRL mechanism [5], which is supported by all major DNS server implementations. We used the noudp module for replying with the TC response to a preset (small) fraction of received UDP queries. In order to be able to do so, we added in cooperation with Knot DNS developers the udp-truncate-rate configuration parameter. It allows us to set the desired rate of artificial TC responses. For example, udp-truncate-rate: N means every Nth UDP response is truncated.

The work described in this report had the following objectives:

Test and evaluate the activation of the noudp module on a production TLD DNS server with different settings of the udp-truncate-rate parameter.
Analyse the expected increase in TCP traffic but also RTT coverage – a measure indicating how big a part of DNS traffic originated from DNS resolvers with known estimates of RTT.
Determine whether RTT estimates obtained with the aid of the noudp module differ significantly from those obtained from natural DNS-over-TCP traffic.

Experimental setup

We performed experiments with the noudp module on a minor node running Knot DNS and serving the .CZ domain (located in Stockholm, Sweden). Due to the configuration of anycast routing, this DNS server receives queries only from a subset of the global Internet as shown in Figure 1.

Worldwide distribution of DNS traffic observed on the test server

The tests were spread over three consecutive 24-hour time slots in March 2021 as follows, each slot with a different setting of the truncate rate (TR):

13:00:00 CET 10 Mar 2021–12:59:59 CET 11 Mar 2021: udp-truncate-rate: 1000
14:00:00 CET 11 Mar 2021–13:59:59 CET 12 Mar 2021: udp-truncate-rate: 100
15:00:00 CET 12 Mar 2021–14:59:59 CET 13 Mar 2021: noudp module off

Hence, in the time slot #1 the noudp module was configured to send a TC response for one out of every 1000 UDP queries, then with a higher rate of 1/100 in the time slot #2 and, finally, the noudp module was turned off in the time slot #3.

DNS traffic rates

As we expected, the activation of the noudp module caused an increase in the number of incoming TCP queries. Figure 2 shows queries per second (QPS) over TCP and UDP during the entire testing period. The overall traffic intensity in the time slot #3 was lower because this slot covers Friday and Saturday.

DNS traffic rates during the test period – TCP (left) and UDP (right)

From the comparison of TCP traffic rates in the time slots #1 and #3 we can estimate that the natural TCP traffic is still responsible for more than one third of the TCP traffic observed for udp-truncate-rate: 1000. In contrast, TCP traffic rate in the time slot #2 (udp-truncate-rate: 100) is almost exactly 1% of UDP, so a vast majority of DNS-over-TCP queries are a consequence of artificial TC responses in this case.

RTT coverage

In the time slots #1 and #2 we also observed a significant increase in the number of distinct resolvers that connected to our authoritative server using TCP. Figure 3 shows the number of distinct TCP and UDP sources per hour during the whole testing period.

Distinct resolvers observed during the test period

Similarly, the noudp module helped us increase the number of autonomous systems and countries for which we registered TCP connections. The quantitative results are summarized in Table 1:

RTT coverage for different settings of truncate rate
	UDP				TCP				TCP as % of UDP
Truncate rate	QPS	Resolvers (24h)	Networks (24h)	Countries (24h)	QPS	Resolvers (24h)	Networks (24h)	Countries (24h)	QPS	Resolvers (24h)	Networks (24h)	Countries (24h)
off	171.37	47284	3102	117	0.20	1000	393	59	0.12%	2.11%	12.67%	50.43%
1000	235.33	55548	3269	121	0.45	4578	859	75	0.19%	8.24%	26.28%	61.98%
100	220.60	53341	3253	119	2.22	12180	1766	99	1.01%	22.83%	54.29%	83.19%

Even with the more aggressive setting (udp-truncate-rate: 100) in the time slot #2, we were able to observe TCP connections from only 23.87% of resolvers. This can be explained by the fact that there are many low-traffic resolvers whose queries are never selected for a TC response.

In order to get a better picture, we introduced RTT coverage that takes into consideration the number of queries sent by a resolver. This way, we could evaluate how big a part of DNS traffic originated from DNS resolvers with known RTT. Specifically, we divided DNS traffic into two groups:

DNS queries originated from a resolver with known RTT (i.e. at least 1 TCP connection was captured for that resolver)
DNS queries originated from a resolver with unknown RTT (no TCP connections from that resolver).

It turned out that even for the less aggressive setting (udp-truncate-rate: 1000) in the time slot #1, more than 85% of the DNS traffic volume comes from resolvers with known RTT, as shown in Figure 4.

RTT coverage of individual resolvers

From the viewpoint of operational planning, it is often more useful to estimate mean RTT for autonomous systems (AS) rather than individual resolvers because AS is the unit of granularity for the BGP protocol that governs IP anycast routing. Figure 5 demonstrates that the noudp module yields a much better RTT coverage for autonomous systems, too.

RTT coverage of autonomous systems

To get a more reliable RTT estimate, it is good to have more than one TCP connection from a resolver or AS. Therefore, we also determined how much traffic comes from autonomous systems for which we captured 10 or more TCP connections. Figure 6 shows that with the noudp module a big part of the traffic again originated from autonomous systems for which we have a reliable estimate of RTT.

Coverage of autonomous systems with more reliable RTT estimates

Length of the measurement period

Another possibility to increase the number of observed resolvers, autonomous systems and countries sending DNS queries over TCP is to extend the measurement period. We analysed natural DNS-over-TCP queries (i.e. with noudp off) captured on another DNS server, also located in Stockholm, in the first two weeks of February 2021. Figure 7 shows the number of distinct resolvers, ASes and countries depending on the length of the measurement time slot that was varied in the range between 1 and 14 days.

Dependence of the number of unique resolvers/ASes/countries on time slot length

Clearly, on a higher level of aggregation (ASes and countries) we are able to increase the coverage by extending the time slot length. However, it is not the case for unaggregated data (individual resolvers).

Estimated RTT distribution and mean values

Given that the use of the noudp module significantly increases RTT coverage, it is reasonable to expect that it also leads to more precise estimates of the round-trip time. In this section, we compare empirical distributions and mean values of RTT obtained from (i) natural TCP traffic (noudp module off), and (ii) augmented TCP traffic resulting from the more aggressive setting of the truncate rate (udp-truncate-rate: 100). We analyse the differences for four autonomous systems (8075, 12322, 13335 and 27357) that generate sufficient TCP traffic on our test server even with noudp off.

Figure 8 shows empirical probability density of RTT for all ASes and both truncate rate settings.

Empirical density of RTT for the selected autonomous systems

It has to be noted that the samples for each TR setting were taken in disjoint time slots (#2 and #3), so the observations may also be potentially influenced by differences in IP routing, network utilization and other factors.

The following four figures show detailed results for the four selected autonomous systems:

In the graph on the left, the points represent individual resolvers (IP addresses), the value being the median of RTT over the time slot. The colours encode countries obtained from a GeoIP database. The dashed line is the RTT average for the natural TCP traffic medians.
The plot on the right shows the estimated difference in mean values of the two TR settings together with the confidence interval of plus/minus two standard deviations.

Observed values and the difference of means for AS8075

Natural TCP traffic of AS8075 (Figure 9) has sources from USA, Ireland and UK. For udp-truncate-rate: 100, we can also see a small presence of Norwegian traffic (around RTT of 28 ms). However, the main reason for the 30% increase in mean RTT for upd-truncate-rate: 100 is the higher relative abundance of traffic from USA.

Observed values and the difference of means for AS12322

AS12322 (Figure 10) only has traffic from France. The 10% increase in mean RTT for udp-truncate-rate: 100 is mainly due to the presence of an outlier with an RTT of 267 ms.

Observed values and the difference of means for AS13335

AS13335 traffic (Figure 11) has geographically diverse sources (Singapore, USA, Sweden or Lithuania). The difference in mean RTT results from contributions with opposite effects for udp-truncate-rate: 100. On the one hand, the higher abundance of traffic from Singapore tends to increase the mean RTT. On the other hand, the larger share of US traffic around 100 ms, as well as more traffic from Northern Europe (Iceland, Latvia, Estonia, Denmark) decrease the mean RTT.

Observed values and the difference of means for AS27357

AS27357 (Figure 12) is again relatively homogeneous and has sources only from the USA. The difference of mean RTT between the two settings of the truncate rate is quite small, approximately 1%.

Discussion

In our study we demonstrated that a significant improvement in RTT coverage can be achieved by replying with truncation (TC) responses to a small fraction of DNS queries. With the truncation rate of 0.01, we were able to achieve 93% RTT coverage compared to 11% that was observed for natural traffic. At the same time, the intensity of DNS-over-TCP traffic remained low (no more than ~1% of the total DNS traffic) and no operational issues were observed on the DNS server.

The impact of TC responses on resolvers sending the queries is less clear. We observed that the number of TCP connections to our DNS servers was slightly lower than the number of TC responses that the test server sent. This means that not all DNS clients obeyed the DNS protocol and resent the DNS query over TCP after getting a truncated response. Our numbers show that the use of the noudp module did not lead to an increase in the percentage of such non-compliant resolvers.

The increased RTT coverage also means more precise estimates of the RTT value. In particular, the additional TCP traffic allows for detecting resolvers and networks with low traffic and poor latency. Quantitatively, the influence of random TC responses on the RTT estimates strongly depends on the character of the autonomous system – the difference varied between 0.75% and 30% with respect to estimates obtained from natural traffic. It is quite likely that for homogeneous ASes with most resolvers having similar latency the natural TCP traffic is enough for obtaining good RTT estimates.

It has to be mentioned that .CZ ccTLD uses elliptic curve keys for DNSSEC, which significantly reduces the size of a DNS message. Therefore, the TCP share in DNS traffic to .CZ servers remains low (less than 1%) even though the .CZ zone is protected with DNSSEC. For a ccTLD that uses RSA keys, the share of TCP traffic may be much higher – on authoritative servers managed by CZ.NIC we often observe as much as 25% for such domains. However, the noudp module can still be useful for obtaining better RTT estimates because, as we noted in the introduction, the subset of DNSSEC-validating resolvers may be biased.

An important question related to the use of the noudp module (or similar approaches) is the optimum setting of the truncate rate: what value is sufficient for obtaining reliable RTT estimates without significantly affecting the DNS server operation? This is a topic for further research.

References

Maciej Andziński. 2019. Passive analysis of DNS server reachability. 14th CENTR R&D Workshop. Retrieved from https://centr.org/library/library/centr-event/rd14-andzinski-passive-analysis-of-dns-server-reachability-20190529.html

Ilya Grigorik. 2013. High performance networking in Chrome: Speed, precision, and a bit of serendipity. In The performance of open source applications, Tavish Armstrong (ed.). 1–19. Retrieved from http://aosabook.org/en/index.html

Giovane C. M. Moura, John Heidemann, Wes Hardaker, Jeroen Bulten, Joao Ceron, and Christian Hesselman. 2020. Old but gold: Prospecting TCP to engineer DNS anycast (extended). USC/Information Sciences Institute, johnh: pafile. Retrieved from https://www.isi.edu/%7ejohnh/PAPERS/Moura20a.html

Yoshibumi Suematsu. 2020. Why has DNSSEC increased in some economies and not others? APNIC. Retrieved from https://blog.apnic.net/2020/07/10/why-has-dnssec-increased-in-some-economies-and-not-others

Paul Vixie and Vernon Schryver. 2012. DNS response rate limiting (DNS RRL). ISC. Retrieved from https://ftp.isc.org/isc/pubs/tn/isc-tn-2012-1.txt

Other ADAM reports » Další reporty »