There are more than 1.36 million .cz domains, but many of them remain inactive. In our study we made an attempt to count all such domains. In order to do so we analysed data delivered by various sources in the ADAM project. We made an assumption that a domain was active if it hosted a non-parking website or a mail server was configured for this domain.
We used DNS crawler to download the web content for each .cz domain. We collected data for various combinations of ports (80/443), prefix labels (empty/www) and IP version (IPv4/IPv6), and we followed all the HTTP redirections. As a result of our work, each domain was associated with one of four classes:
4xx
or 5xx
HTTP status codeIn order to detect the so-called parking web pages, we deployed machine learning methods. As an input for the classifier we used preprocessed web content (visible text extracted from HTML, no JS rendering) gathered by DNS crawler. Our model was based on TF-IDF, a text mining concept which facilitates identifying important words. This importance is estimated based on the word’s frequency in a document with respect to its frequency in the corpus. In addition to counting words, we also investigated frequencies of word sequences (1-3 level n-grams). For example, for parking web pages such a popular sequence (after removing accents and stop words) was “domena zaregistrovana” (Czech: “domain registered”), and for non-parking websites it was “vsechna prava vyhrazena” (Czech: “all rights reserved”).
Our model was trained using the results of manual classification, which we perform every year for the domain report. We managed to get 92% accuracy for the testing set (F1 score = 0.92), in other words: web content was classified correctly for 92 out of 100 domains in a batch.
A domain was considered to provide mail service if DNS crawler successfully connected to port 25, 465 or 587 on a mail server for this domain (indicated in the mail exchange (MX) record or in A/AAAA record if there was no MX, see RFC5321).
Each domain was marked by one of the following classes:
At the end we combined data about web content and mail service in order to make the final prediction. Each domain was classified as:
On 23 October 2020 we scanned 1,365,753 .cz domains gathering 200 GB of data which we used as an input for our model. For 61.1% of domains, the web content was classified as a non-parking website, and parking web pages constituted 20.1%. We were not able to get the web content for 12.0% of domains and for 6.7% of domains an HTTP error was observed. Around 69.3% of domains had an operating mail server.
79.6% of domains had either a non-parking
website or a working mail server.
An interesting finding was that the domain age (time between its registration and 23 October 2020) is correlated with the web content classification result. Old domains were more likely to host a non-parking website, for domains older than 20 years this percentage reached 76.7%. On the contrary, 34.8% of domains younger than one year were hosting a parking web page.
Similar trend was observed for mail service - older domains had
higher percentage of active mail servers.
We analysed the correlation between the length of a second-level domain label and web content classification results. It turned out that 3-letter domains were less likely to host a non-parking website (53.0%). We believe this phenomenon could be explained by the fact that short domains are often registered for other purpose than hosting a website (e.g. for profit or to run a second level domain registry).
A different trend was observed for mail service. It can be
observed that domains with a longer label were less likely to have a
mail server.
We analysed DNS queries to .cz DNS servers in order to evaluate the popularity of .cz domains. For each domain we counted the number of distinct sources (DNS resolvers) which sent DNS queries for that domain on 23 October 2020. It is not a surprise to see that the number of distinct sources is the highest for domains classified as active (i.e. having a non-parking website or a mail server).
Final prediction | median | mean | q=0.05 | q=0.25 | q=0.75 | q=0.95 |
---|---|---|---|---|---|---|
Active web or mail | 98 | 289 | 6 | 39 | 261 | 944 |
Inactive web and mail | 28 | 80 | 2 | 11 | 54 | 327 |
In the chart below, the cumulative distribution of DNS sources is presented.
Classification results varied depending on the registrar, although some trends could be spotted. The percentage of active domains is shown in the chart below. Common tendency can be observed for big players – percentage of active domains in their portfolio is around 72%.
Our study revealed that around 80% of .cz domains host a non-parking website or run a mail server. This percentage is higher for older domains.
It has to be mentioned that a small number of observations could be misclassified, but we believe that the accuracy of 92% is good enough to make inferences about the web content classification. Moreover, in our study we only focused on web and mail as these are the most popular services associated with second-level domains.