Abstract
This report is a follow-up to our previous research on passive analysis of round-trip time (RTT) for DNS transactions. The new method described herein utilizes randomly sent truncated DNS replies for increasing the DNS-over-TCP traffic volume that is used for estimating RTT. The method was experimentally deployed on a production DNS server for ccTLD .CZ. Quantitative results are compared to normal traffic in terms of TCP traffic volume, RTT coverage as well as empirical distributions and mean values of RTT.Network latency is an important parameter that influences end-user’s experience for most interactive Internet applications, notably web browsing. In virtually all cases, a share of the overall latency budget can be attributed to DNS. Browser vendors invested a lot of effort into minimizing latency, and DNS latency in particular, at the client side [2].
Operators of top-level DNS domains (TLD), such as CZ.NIC, may improve the latency for their clients – DNS resolvers – either by tuning BGP parameters that control anycast routing, or by deploying new DNS servers in regions with suboptimal latency. The latter is often not a trivial undertaking, in terms of both expenses and logistics. Therefore, it has to be carefully planned based on the assessment of actual latency.
In 2019 we presented a method for passive analysis of DNS round-trip time (RTT) [1] based on estimating RTT from the TCP handshake between a DNS client and server. The advantage of this approach is that RTT estimates can be obtained directly from the existing DNS traffic. On the other hand, a potential drawback is that the prevalent transport protocol for DNS is UDP, and TCP is used only for a small fraction of DNS traffic. Moreover, this “natural” TCP traffic may be biased in the sense that the originating resolvers are not a representative sample of all resolvers in terms of RTT. One can expect, for example, to observe TCP connections mostly from DNSSEC-validating resolvers, and worldwide penetration of DNSSEC is still far from homogeneous [4].
One possibility for convincing more resolvers to send DNS queries over TCP is provided by the DNS protocol itself: it is the TC (TrunCation) bit in the DNS header, which is normally used by a DNS server for indicating to the resolver that the response exceeds the maximum size permitted for UDP, and is therefore truncated. An idea of artificially increasing the volume of DNS-over-TCP traffic by replying with random TC responses to a small fraction of DNS queries was proposed as a logical continuation of our study [1], and we first shared it with the CENTR community members interested in doing TCP-based RTT analysis. The same approach was also proposed later by a group of DNS researchers who conducted similar passive RTT analysis for the ccTLD of The Netherlands [3].
Knot DNS has the optional noudp
module that was originally intended as a simpler alternative to the RRL
mechanism [5], which is supported by all
major DNS server implementations. We used the noudp module for
replying with the TC response to a preset (small) fraction of received
UDP queries. In order to be able to do so, we added in cooperation with
Knot DNS developers the udp-truncate-rate
configuration parameter. It allows us to set the desired rate of
artificial TC responses. For example, udp-truncate-rate: N
means every Nth UDP response is truncated.
The work described in this report had the following objectives:
Test and evaluate the activation of the noudp module on
a production TLD DNS server with different settings of the
udp-truncate-rate
parameter.
Analyse the expected increase in TCP traffic but also RTT coverage – a measure indicating how big a part of DNS traffic originated from DNS resolvers with known estimates of RTT.
Determine whether RTT estimates obtained with the aid of the noudp module differ significantly from those obtained from natural DNS-over-TCP traffic.
We performed experiments with the noudp module on a minor node running Knot DNS and serving the .CZ domain (located in Stockholm, Sweden). Due to the configuration of anycast routing, this DNS server receives queries only from a subset of the global Internet as shown in Figure 1.
The tests were spread over three consecutive 24-hour time slots in March 2021 as follows, each slot with a different setting of the truncate rate (TR):
udp-truncate-rate: 1000
udp-truncate-rate: 100
noudp module off
Hence, in the time slot #1 the noudp module was configured to send a TC response for one out of every 1000 UDP queries, then with a higher rate of 1/100 in the time slot #2 and, finally, the noudp module was turned off in the time slot #3.
As we expected, the activation of the noudp module caused an increase in the number of incoming TCP queries. Figure 2 shows queries per second (QPS) over TCP and UDP during the entire testing period. The overall traffic intensity in the time slot #3 was lower because this slot covers Friday and Saturday.
From the comparison of TCP traffic rates in the time slots #1 and #3
we can estimate that the natural TCP traffic is still responsible for
more than one third of the TCP traffic observed for
udp-truncate-rate: 1000
. In contrast, TCP traffic rate in
the time slot #2 (udp-truncate-rate: 100
) is almost exactly
1% of UDP, so a vast majority of DNS-over-TCP queries are a consequence
of artificial TC responses in this case.
In the time slots #1 and #2 we also observed a significant increase in the number of distinct resolvers that connected to our authoritative server using TCP. Figure 3 shows the number of distinct TCP and UDP sources per hour during the whole testing period.
Similarly, the noudp module helped us increase the number of autonomous systems and countries for which we registered TCP connections. The quantitative results are summarized in Table 1:
Truncate rate | QPS | Resolvers (24h) | Networks (24h) | Countries (24h) | QPS | Resolvers (24h) | Networks (24h) | Countries (24h) | QPS | Resolvers (24h) | Networks (24h) | Countries (24h) |
---|---|---|---|---|---|---|---|---|---|---|---|---|
off | 171.37 | 47284 | 3102 | 117 | 0.20 | 1000 | 393 | 59 | 0.12% | 2.11% | 12.67% | 50.43% |
1000 | 235.33 | 55548 | 3269 | 121 | 0.45 | 4578 | 859 | 75 | 0.19% | 8.24% | 26.28% | 61.98% |
100 | 220.60 | 53341 | 3253 | 119 | 2.22 | 12180 | 1766 | 99 | 1.01% | 22.83% | 54.29% | 83.19% |
Even with the more aggressive setting
(udp-truncate-rate: 100
) in the time slot #2, we were able
to observe TCP connections from only 23.87% of resolvers. This can be
explained by the fact that there are many low-traffic resolvers whose
queries are never selected for a TC response.
In order to get a better picture, we introduced RTT coverage that takes into consideration the number of queries sent by a resolver. This way, we could evaluate how big a part of DNS traffic originated from DNS resolvers with known RTT. Specifically, we divided DNS traffic into two groups:
It turned out that even for the less aggressive setting
(udp-truncate-rate: 1000
) in the time slot #1, more than
85% of the DNS traffic volume comes from resolvers with known RTT, as
shown in Figure 4.
From the viewpoint of operational planning, it is often more useful to estimate mean RTT for autonomous systems (AS) rather than individual resolvers because AS is the unit of granularity for the BGP protocol that governs IP anycast routing. Figure 5 demonstrates that the noudp module yields a much better RTT coverage for autonomous systems, too.
To get a more reliable RTT estimate, it is good to have more than one TCP connection from a resolver or AS. Therefore, we also determined how much traffic comes from autonomous systems for which we captured 10 or more TCP connections. Figure 6 shows that with the noudp module a big part of the traffic again originated from autonomous systems for which we have a reliable estimate of RTT.
Another possibility to increase the number of observed resolvers, autonomous systems and countries sending DNS queries over TCP is to extend the measurement period. We analysed natural DNS-over-TCP queries (i.e. with noudp off) captured on another DNS server, also located in Stockholm, in the first two weeks of February 2021. Figure 7 shows the number of distinct resolvers, ASes and countries depending on the length of the measurement time slot that was varied in the range between 1 and 14 days.
Clearly, on a higher level of aggregation (ASes and countries) we are able to increase the coverage by extending the time slot length. However, it is not the case for unaggregated data (individual resolvers).
Given that the use of the noudp module significantly
increases RTT coverage, it is reasonable to expect that it also leads to
more precise estimates of the round-trip time. In this section, we
compare empirical distributions and mean values of RTT obtained from
(i) natural TCP traffic (noudp module off), and (ii) augmented
TCP traffic resulting from the more aggressive setting of the truncate
rate (udp-truncate-rate: 100
). We analyse the differences
for four autonomous systems (8075, 12322, 13335 and 27357) that generate
sufficient TCP traffic on our test server even with noudp
off.
Figure 8 shows empirical probability density of RTT for all ASes and both truncate rate settings.
It has to be noted that the samples for each TR setting were taken in disjoint time slots (#2 and #3), so the observations may also be potentially influenced by differences in IP routing, network utilization and other factors.
The following four figures show detailed results for the four selected autonomous systems:
Natural TCP traffic of AS8075 (Figure 9) has sources from USA,
Ireland and UK. For udp-truncate-rate: 100
, we can also see
a small presence of Norwegian traffic (around RTT of 28 ms). However,
the main reason for the 30% increase in mean RTT for
upd-truncate-rate: 100
is the higher relative abundance of
traffic from USA.
AS12322 (Figure 10) only has traffic from France. The 10% increase in
mean RTT for udp-truncate-rate: 100
is mainly due to the
presence of an outlier with an RTT of 267 ms.
AS13335 traffic (Figure 11) has geographically diverse sources
(Singapore, USA, Sweden or Lithuania). The difference in mean RTT
results from contributions with opposite effects for
udp-truncate-rate: 100
. On the one hand, the higher
abundance of traffic from Singapore tends to increase the mean RTT. On
the other hand, the larger share of US traffic around 100 ms, as well as
more traffic from Northern Europe (Iceland, Latvia, Estonia, Denmark)
decrease the mean RTT.
AS27357 (Figure 12) is again relatively homogeneous and has sources only from the USA. The difference of mean RTT between the two settings of the truncate rate is quite small, approximately 1%.
In our study we demonstrated that a significant improvement in RTT coverage can be achieved by replying with truncation (TC) responses to a small fraction of DNS queries. With the truncation rate of 0.01, we were able to achieve 93% RTT coverage compared to 11% that was observed for natural traffic. At the same time, the intensity of DNS-over-TCP traffic remained low (no more than ~1% of the total DNS traffic) and no operational issues were observed on the DNS server.
The impact of TC responses on resolvers sending the queries is less clear. We observed that the number of TCP connections to our DNS servers was slightly lower than the number of TC responses that the test server sent. This means that not all DNS clients obeyed the DNS protocol and resent the DNS query over TCP after getting a truncated response. Our numbers show that the use of the noudp module did not lead to an increase in the percentage of such non-compliant resolvers.
The increased RTT coverage also means more precise estimates of the RTT value. In particular, the additional TCP traffic allows for detecting resolvers and networks with low traffic and poor latency. Quantitatively, the influence of random TC responses on the RTT estimates strongly depends on the character of the autonomous system – the difference varied between 0.75% and 30% with respect to estimates obtained from natural traffic. It is quite likely that for homogeneous ASes with most resolvers having similar latency the natural TCP traffic is enough for obtaining good RTT estimates.
It has to be mentioned that .CZ ccTLD uses elliptic curve keys for DNSSEC, which significantly reduces the size of a DNS message. Therefore, the TCP share in DNS traffic to .CZ servers remains low (less than 1%) even though the .CZ zone is protected with DNSSEC. For a ccTLD that uses RSA keys, the share of TCP traffic may be much higher – on authoritative servers managed by CZ.NIC we often observe as much as 25% for such domains. However, the noudp module can still be useful for obtaining better RTT estimates because, as we noted in the introduction, the subset of DNSSEC-validating resolvers may be biased.
An important question related to the use of the noudp
module (or similar approaches) is the optimum setting of the truncate
rate: what value is sufficient for obtaining reliable RTT estimates
without significantly affecting the DNS server operation? This is a
topic for further research.