There are more than 1.36 million .cz domains, but many of them remain inactive. In our study we made an attempt to count all such domains. In order to do so we analysed data delivered by various sources in the ADAM project. We made an assumption that a domain was active if it hosted a non-parking website or a mail server was configured for this domain.

Web content

We used DNS crawler to download the web content for each .cz domain. We collected data for various combinations of ports (80/443), prefix labels (empty/www) and IP version (IPv4/IPv6), and we followed all the HTTP redirections. As a result of our work, each domain was associated with one of four classes:

  • No content - if its web server was unreachable or if the web content was empty
  • HTTP error - if its web server answered with 4xx or 5xx HTTP status code
  • Parking web - if its web content was classified as parking website
  • Active web - if its web content was classified as non-parking website

In order to detect the so-called parking web pages, we deployed machine learning methods. As an input for the classifier we used preprocessed web content (visible text extracted from HTML, no JS rendering) gathered by DNS crawler. Our model was based on TF-IDF, a text mining concept which facilitates identifying important words. This importance is estimated based on the word’s frequency in a document with respect to its frequency in the corpus. In addition to counting words, we also investigated frequencies of word sequences (1-3 level n-grams). For example, for parking web pages such a popular sequence (after removing accents and stop words) was “domena zaregistrovana” (Czech: “domain registered”), and for non-parking websites it was “vsechna prava vyhrazena” (Czech: “all rights reserved”).

Our model was trained using the results of manual classification, which we perform every year for the domain report. We managed to get 92% accuracy for the testing set (F1 score = 0.92), in other words: web content was classified correctly for 92 out of 100 domains in a batch.

Mail service

A domain was considered to provide mail service if DNS crawler successfully connected to port 25, 465 or 587 on a mail server for this domain (indicated in the mail exchange (MX) record or in A/AAAA record if there was no MX, see RFC5321).

Each domain was marked by one of the following classes:

  • Active mail - if there was a reachable mail server for this domain
  • Inactive mail - if there was no mail server for this domain or it was unreachable

Final prediction

At the end we combined data about web content and mail service in order to make the final prediction. Each domain was classified as:

  • Active web or mail - if it was a member of either Active web or Active mail classes (i.e. it hosted a non-parking website or had a working mail server)
  • Inactive web and mail - if there was neither a non-parking website nor an active mail server for this domain

Results

On 23 October 2020 we scanned 1,365,753 .cz domains gathering 200 GB of data which we used as an input for our model. For 61.1% of domains, the web content was classified as a non-parking website, and parking web pages constituted 20.1%. We were not able to get the web content for 12.0% of domains and for 6.7% of domains an HTTP error was observed. Around 69.3% of domains had an operating mail server.


79.6% of domains had either a non-parking website or a working mail server.

Domain age

An interesting finding was that the domain age (time between its registration and 23 October 2020) is correlated with the web content classification result. Old domains were more likely to host a non-parking website, for domains older than 20 years this percentage reached 76.7%. On the contrary, 34.8% of domains younger than one year were hosting a parking web page.


Similar trend was observed for mail service - older domains had higher percentage of active mail servers.

Domain label length

We analysed the correlation between the length of a second-level domain label and web content classification results. It turned out that 3-letter domains were less likely to host a non-parking website (53.0%). We believe this phenomenon could be explained by the fact that short domains are often registered for other purpose than hosting a website (e.g. for profit or to run a second level domain registry).


A different trend was observed for mail service. It can be observed that domains with a longer label were less likely to have a mail server.

DNS traffic

We analysed DNS queries to .cz DNS servers in order to evaluate the popularity of .cz domains. For each domain we counted the number of distinct sources (DNS resolvers) which sent DNS queries for that domain on 23 October 2020. It is not a surprise to see that the number of distinct sources is the highest for domains classified as active (i.e. having a non-parking website or a mail server).

Number of sources of DNS queries per domain
Final prediction median mean q=0.05 q=0.25 q=0.75 q=0.95
Active web or mail 98 289 6 39 261 944
Inactive web and mail 28 80 2 11 54 327

In the chart below, the cumulative distribution of DNS sources is presented.

Registrars

Classification results varied depending on the registrar, although some trends could be spotted. The percentage of active domains is shown in the chart below. Common tendency can be observed for big players – percentage of active domains in their portfolio is around 72%.

Conclusions

Our study revealed that around 80% of .cz domains host a non-parking website or run a mail server. This percentage is higher for older domains.

It has to be mentioned that a small number of observations could be misclassified, but we believe that the accuracy of 92% is good enough to make inferences about the web content classification. Moreover, in our study we only focused on web and mail as these are the most popular services associated with second-level domains.

Other ADAM reports » Další reporty »
© CZ.NIC, z.s.p.o., 
ADAM is an R&D project that tries to get the most of the big data generated by DNS and other services operated by CZ.NIC.
Projekt ADAM se snaží vytěžit maximum z dat získávaných z DNS a dalších služeb provozovaných sdružením CZ.NIC.