Exploring the distribution of mail services in the CZ domain

ADAM report 1/2023

Authors

Dan Řezníček

Ladislav Lhotka

Published

July 27, 2023

Code

library(dplyr)
library(ggplot2)
library(plotly)
library(lubridate)
library(tidyr)
library(knitr)
library(kableExtra)
library(formattable)
library(jsonlite)
library(forcats)
library(colorspace)
library(readr)
library(brms)
library(bayesplot)
library(tidybayes)
library(future)
library(countrycode)
library(rnaturalearth)
library(sf)
library(lwgeom)
library(maps)
library(units)
library(s2)
library(corrplot)
library(performance)
library(ggthemes)
library(patchwork)
library(units)
library(stringr)
library(stringi)
library(cmdstanr)
library(scales)
theme_set(theme_minimal())

Code

root_endpoint <- "https://stats.nic.cz"

api_endpoint <- function(endpoint, ...) {
  pars <- list(...)
  if (length(pars) == 0) {
    return(paste0(root_endpoint, endpoint))
  }
  pstr <- mapply(function(x, y) paste(x, URLencode(toString(y)), sep = "="),
                 names(pars), pars)
  paste0(root_endpoint, endpoint, "?", paste(pstr, collapse = "&"))
}

Introduction

During the early years of the Internet, all institutions operated their own mail servers. Their MX (mail exchange) resource records in the DNS thus pointed to domain names within the same domain. However, over the past 10 or 15 years, MX records have undergone a process of outsourcing to mail service providers ranging from local ISPs to large national or global technology companies such as Seznam.cz or Google’s GMail. On the other hand, many second-level domains still have MX records pointing to mail servers within the same domain (see Table 1). In this report, we explore MX records of all second-level domains under CZ with the aim of assessing the distribution and accumulation of mail services across countries and autonomous systems. We provide both exploratory descriptive statistics and also some modeling-based insights.

Code

# These values were taken from a full scan, 2023-07-06.
mx_table <- data.frame(
  "Domains with:" =
    c(" - at least one MX record with its own name.",
      " - at least one MX record without its own name.",
      " - all MX records with its own name."),
  Count  =
    c(132142,
      808737,
      116388),
  check.names = FALSE
  )
mx_table |>
  kable("html",
        format.args = list(big.mark = " ")) |>
  kable_styling(
    bootstrap_options = c("striped",
                          "hover",
                          "condensed",
                          "responsive"),
    full_width = TRUE,
    position   = "left")

Table 1: Outsourcing of MX records.
Domains with:	Count
- at least one MX record with its own name.	132 142
- at least one MX record without its own name.	808 737
- all MX records with its own name.	116 388

We utilized two primary datasets collected by the CZ.NIC’s DNS crawler, aggregating domain counts by (1) countries and (2) autonomous systems from which the mail servers indicated in MX records are operated. In these datasets, domain names appearing in MX records were resolved to IP address(es) that were then assigned to the corresponding countries or autonomous systems via a geolocation database. Note that a second-level domain may have multiple mail servers operating from various countries and/or autonomous systems. If it is the case, such a domain contributes to the counts of multiple countries or autonomous systems.

The data used in this report are static in the sense that the datasets do not provide a dynamic view of CZ mail server trends (e.g., time series) but just a limited peek into their distributions at a certain date or time period:

For the countries, all data correspond to a single day of 2023-06-01.
For the autonomous systems, the data belong to an interval from 2023-04-20 to 2023-06-15.

Sections labeled as ▶ Code that are interspersed in the text below can be expanded to reveal the actual R code used for producing the statistics, graphs, and tables.

Country perspective

First, we explored geolocations of mail servers of all second-level domains under CZ, using data provided by DNS crawler endpoint /crawler_mail_cc, which aggregates the count of domains with MX records by countries.

Code

df_cc  <- fromJSON(api_endpoint("/crawler_mail_cc"))
country_flags <- read.csv("CSV/country_flags.csv", sep = ";")

Code

df_cc <- df_cc |>
  mutate(ts = as.Date(ts)) |>
  group_by(cc) |>
  filter(ts == "2023-06-01") |>
  ungroup() |>
  rename("mx" = "domains") |>
  mutate(perc = mx / sum(mx) *100) |>
  replace_na(list(cc = "Unknown"))
country_flags <- country_flags |>
  rename(cc = country_code)
df_cc <- left_join(df_cc, country_flags, by = "cc") |>
  replace_na(list(country_flags = "?"))

Distribution of domains with MX records

Table 2 describes the distribution of second-level domains that have at least one MX record. The first row shows frequencies for all countries; however, as the global distribution is hugely influenced by the CZ frequencies (the maximum value at 1 056 497, 62.96% of all such domains; see also Figure 1 below and Table 4 in the Appendix), the second row presents frequencies for all countries except for the Czech Republic, alleviating the bias stemming from the CZ dominance.

Code

a <- df_cc |>
  summarize(Subset = "All countries",
            Min    = min(mx),
            Mean   = mean(mx),
            Median = median(mx),
            Max    = max(mx),
            Total  = sum(mx)
  ) 
b <- df_cc |>
  filter(cc != "CZ") |>
  summarize(Subset = "No CZ",
            Min    = min(mx),
            Mean   = mean(mx),
            Median = median(mx),
            Max    = max(mx),
            Total  = sum(mx)
  )

descriptives_table <- bind_rows(a, b)
descriptives_table |>
  kable("html",
        digits = 2,
        format.args = list(big.mark = " ")) |>
  kable_styling(
    bootstrap_options = c("striped",
                          "hover",
                          "condensed",
                          "responsive"),
    full_width = TRUE,
    position   = "left")

Table 2: Distribution of domains with MX records.
Subset	Min	Mean	Median	Max	Total
All countries	1	18 239.73	68	1 056 497	1 678 055
No CZ	1	6 830.31	61	112 651	621 558

To illustrate the absolute count of these domains, we also filtered countries with more than 100 domains and plotted a barchart which, once again, shows the overwhelming and unsurprising dominance of domains with MX records pointing to mail servers within the Czech Republic.

Code

df_cc_100 <- df_cc |>
  filter(mx > 100) |> 
  arrange(mx)
plot_ly(
  data = df_cc_100,
  x    = ~mx,
  type = "bar",
  name = "Domains by country",
  hovertemplate = paste0(
    "<b>", "Country: ", "</b>", df_cc_100$cc,
    " ", df_cc_100$country_flag, "<br>",
    "<b>", "Domain count: ", "</b>", df_cc_100$mx,
    " (", round(df_cc_100$perc, 2), "%)",
    "<extra></extra>"
  ),
  text         = paste0(
    df_cc_100$country_flag, " ", df_cc_100$cc),
  textposition = "auto",
  marker       = list(color = c(rep("#003893", 41), "#F4587A")
  )
) |>
  layout(
    hovermode = "y unified",
    xaxis     = list(title = "Domain count"),
    yaxis = list (
      title          = "Country",
      showticklabels = FALSE
    )
  )

Figure 1: Distribution of domains with MX records by country (>100).

A complete list of all countries with domains that use MX records can be inspected in Table 4 in the Appendix. To provide additional insight into the relative weight of foreign countries, Table 4 also presents a column Domains percentage (no CZ) which specifies the percentage of all domains with MX records excluding those with mail servers located in the Czech Republic.

Spatial distribution

Furthermore, we explored the spatial distribution of mail servers indicated in the domains’ MX records. In Figure 2, we present log-transformed values of the absolute count of domains with mail servers in given countries. To inspect the absolute counts, hover the mouse cursor over the respective countries. Note, however, that the color scale representing log-transformed values is somewhat misleading in that it flattens the differences.

Code

df_cc <- df_cc |>
  mutate(cc_iso = countrycode(cc, origin  = "iso2c",
                              destination = "iso3c")
         ) |>
  replace_na(list(cc = "Unknown"))
df_cc <- df_cc |>
  mutate(mx_log = log(mx)) |>
  replace_na(list(cc = "Unknown"))
plot_ly(
  df_cc,
  type          = "choropleth",
  locations     = df_cc$cc_iso,
  z             = df_cc$mx_log,
  colors        = "Blues",
  hovertemplate = paste0(
    "<b>Count: </b>", df_cc$mx,
    " in ", df_cc$country_flag, " ", df_cc$cc, "<br>",
    "<b>Log: </b>", round(df_cc$mx_log, 2),
    "<extra></extra>"
    )
  ) |>
  colorbar(title = "Log")

Figure 2: Worldwide distribution of domains with MX records (log values).

Modeling mail servers outside of the Czech Republic, part 1

Unsurprisingly, IP addresses of most mail servers listed in MX records are geolocated in the Czech Republic. However, one can ask what factors might lead domain holders to set up mail servers outside of the Czech Republic? To provide a better understanding of such variations and possible influences, we utilized Bayesian regression models to test the associations between the count of domains with MX records that indicate mail servers operated from abroad and five predictors which we hypothesized to associate with the these counts.

What is Bayesian regression modeling?

Linear regression analysis is used to predict the value of a response variable (i.e., dependent variable) based on the value of another predictor variable (i.e., independent variable). In the Bayesian framework, linear regression formulates probability distributions for such predictions (in comparison to point estimates formulated within the frequentist framework). That is to say, the response variable is not estimated as a single value, but is assumed to be drawn from a probability distribution as Bayesian linear regression does not attempt to find one “best” value of the model parameters, but rather to determine the posterior distribution for the model parameters. When doing so, the analyst can also incorporate prior knowledge about what the parameters should be (or use non-informative priors if one has no guesses about the possible values of the parameters).

The result of a Bayesian linear regression model are distributions of possible model parameters (most notably the values by which the response variable changes depending on the predictor variables) based on the data and the priors. Practically, one can express the relationships between the response and predictors variables by describing these posterior distributions. For example, one can report the mean of the parameter’s posterior distribution and its credible interval, which helps to capture the uncertainty about the possible values of the parameter.

Predictors

Geographic distance from the Czech Republic

First, domain holders might prefer mail servers in countries that are geographically closer to the Czech Republic as they usually provide better round-trip times. We thus hypothesized that there should be a higher count of domains that use MX records in countries closer to the Czech Republic.

To calculate the distances between the Czech Republic and other countries, we first calculated centroids for each country (the geographic point at the center of gravity for a polygon that approximates each country’s borders) and then calculated the distance between the Czech Republic’s centroid coordinates and respective foreign countries’ centroids. As the distribution of these values showed great dispersion, we log-transformed them for the purposes of modeling.

Code

df_cc2 <- df_cc |>
  select(mx, cc_iso)

sf::sf_use_s2(FALSE)
world <- ne_countries(returnclass = "sf", scale = "small")
world <- world |>
  mutate(centroids = st_centroid(geometry))
CZ_polygon <- world |>
  filter(iso_a2 == "CZ")
CZ_centroid <- CZ_polygon |>
  st_centroid(CZ_polygon)
world <- world |>
  mutate(CZ_centroid = CZ_centroid)
world <- world |>
  mutate(distances = st_distance(CZ_centroid,
                                 centroids,
                                 by_element = TRUE)
         ) |>
  select(iso_a3, name, distances) |>
  rename(cc_iso = iso_a3) |>
  st_drop_geometry()
world <- world |>
  mutate(cc_iso = countrycode(name,
                              origin      = "country.name",
                              destination = "iso3c")) |>
  drop_na()

tiny_countries <- ne_countries(type        = "tiny_countries",
                               returnclass = "sf")
tiny_countries <- tiny_countries |>
  mutate(CZ_centroid = CZ_centroid)
tiny_countries <- tiny_countries |>
  mutate(distances = st_distance(CZ_centroid,
                                 geometry,
                                 by_element = TRUE)) |>
  select(iso_a3, name, distances) |>
  st_drop_geometry() |>
  mutate(cc_iso = countrycode(name,
                              origin      = "country.name",
                              destination = "iso3c"))
world <- bind_rows(world, tiny_countries)
df_cc2 <- full_join(df_cc2, world, by = "cc_iso")

df_cc2 <- df_cc2 |>
  filter(cc_iso != "CZE") |>
  select(-iso_a3) |>
  drop_units()

Population size

We also hypothesized that domain holders are more likely to have mail servers in countries with higher populations, as mail service providers are more motivated to deploy their servers in such countries.

The values for population sizes were downloaded from The World Bank. We used the most recent values (year 2021) which we log-transformed.

Code

pop <- read.csv("CSV/POP.csv", sep = ",")
pop <- pop |>
  select(cc, X2021) |>
  rename(population = X2021,
         cc_iso     = cc)
df_cc2 <- full_join(df_cc2, pop, by = "cc_iso")

GDP per capita

Furthermore, we also hypothesized that GDP per capita might increase the count of domains with MX records pointing to mail servers located in foreign countries. Again, the idea was that such countries have a better network infrastructure and may thus be more attractive for server deployments.

Again, the values for GDP per capita were downloaded from The World Bank. We used the most recent values (year 2021, in US dollars) which we log-transformed.

Code

gdp <- read.csv("CSV/GDP.csv", sep = ",")
gdp <- gdp |>
  select(cc, X2021) |>
  rename(gdp    = X2021,
         cc_iso = cc)
df_cc2 <- full_join(df_cc2, gdp, by = "cc_iso")

Export and import value

However, the size of Czech export to foreign countries and foreign countries’ import from the Czech Republic might provide a more granular and trading-network-reflective proxies for international connections than GDP per capita, geographic distance, and population size. Therefore, we hypothesized that the value of Czech Republic’s export to foreign countries would associate with the count of domains that use mail servers operated from abroad. Similarly, we hypothesized that the value of foreign countries’ import from the Czech Republic would associate with the count of these domains.

The values for the size of Czech export and import were downloaded from The Observatory of Economic Complexity (OEC). These values are proxies for a direct trading relationship between the Czech Republic and a given country. We used the most recent freely available values (year 2021) which we log-transformed.

Code

export <- read.csv("CSV/EXPORT.csv", sep = ",")
export <- export |>
  mutate(ISO.3 = replace(ISO.3,
                         Country == "Chinese Taipei",
                         "twn")) |>
  rename(cc_iso = ISO.3,
         export = Trade.Value) |>
  select(cc_iso, export) |>
  mutate(cc_iso = toupper(cc_iso))

df_cc2 <- full_join(df_cc2, export, by = "cc_iso")

import <- read.csv("CSV/IMPORT.csv", sep = ",")
import <- import |>
  mutate(ISO.3 = replace(ISO.3,
                         Country == "Chinese Taipei",
                         "twn")) |>
  rename(cc_iso = ISO.3,
         import = Trade.Value) |>
  select(cc_iso, import) |>
  mutate(cc_iso = toupper(cc_iso))

df_cc2 <- full_join(df_cc2, import, by = "cc_iso")

Treatment of missing values

Upon pairing all the predictor data with our original data on domains using MX records into one dataset, we obtained a number of rows with missing values.

First, we manually adjusted the data for the population size, GDP per capita, and geographic distance for larger countries, filling in the missing values.

Second, as our original dataset does not contain data on countries that do not operate any mail servers under CZ, we assumed that such countries have zero domains using MX records and replaced such missing values with zeros. Similarly, in the case of Czech export and foreign import, we assumed that countries not captured in the OEC datasets do not trade with the Czech Republic, hence we replaced such missing values with zeros.

Lastly, for small (mostly island) countries, we did not attempt to fill in the missing values. Therefore, these countries are omitted in the analysis (as regression models cannot handle rows with missing values) and can be inspected in Table 6 in the Appendix.

Code

df_cc2 <- df_cc2 |>
  mutate(population = replace(population,
                              cc_iso == "TWN",
                              23859912),
         gdp        = replace(gdp,
                              cc_iso == "TWN",
                              33186.34),
         gdp        = replace(gdp,
                              cc_iso == "VEN",
                              2071.6),
         gdp        = replace(gdp,
                              cc_iso == "KWT",
                              28884.3),
         gdp        = replace(gdp,
                              cc_iso == "TKM",
                              10111),
         gdp        = replace(gdp,
                              cc_iso == "YEM",
                              549.49),
         gdp        = replace(gdp,
                              cc_iso == "CUB",
                              7291),
         gdp        = replace(gdp,
                              cc_iso == "GRL",
                              54571),
         gdp        = replace(gdp,
                              cc_iso == "ERI",
                              614),
         gdp        = replace(gdp,
                              cc_iso == "PRK",
                              654),
         distances  = replace(distances,
                              cc_iso == "SYC",
                              7165900),
         distances  = replace(distances,
                              cc_iso == "HKG",
                              8750280),
         distances  = replace(distances,
                              cc_iso == "AND",
                              1330730),
         name       = replace(name,
                              cc_iso == "HKG",
                              "Hong Kong"),
         name       = replace(name,
                              cc_iso == "SYC",
                              "Seychelles")) |>
  mutate(mx        = replace_na(mx, 0),
         export    = replace_na(export,0),
         import    = replace_na(import, 0),
         distances = distances/1000) |>
  mutate(distance_log   = log(distances + 1),
         population_log = log(population + 1),
         gdp_log        = log(gdp + 1),
         export_log     = log(export + 1),
         import_log     = log(import +1)) |>
  dplyr::distinct(cc_iso, .keep_all = TRUE)

df_cc_NAs <- 
  df_cc2 |>
  mutate(mx     = replace_na(mx, 0),
         export = replace_na(export, 0))
df_cc_NAs <- df_cc_NAs[!complete.cases(df_cc_NAs),]

df_cc2 <- df_cc2 |> drop_na()

Passive properties or active channels?

While all of the proposed predictors showed some correlations between each other (see Figure 3 below), they theoretically differ in how they function as proxies for domain holders’ tendency for setting up their mail services outside of the Czech Republic. While geographic distance, population size, and GDP per capita are properties describing respective countries, the values of export and import describe the interaction between the Czech Republic and given countries. In other words, while there is no “human design” behind geographic distance or population size of foreign countries in respect to mail servers indicated in MX records of second-level domains under CZ, both of these variables could influence these counts as a passive side-effect of geographic closeness or high-enough population numbers. Similarly, GDP per capita of foreign countries is an index not directly connected to the Czech Republic but may influence the count of domains as a side-effect of foreign countries’ economic power.

In contrast, Czech export to foreign countries and foreign import from the Czech Republic can be conceptualized as intentionally directed and active networking channels. Therefore, the count of domains with MX records pointing to mail servers in given countries could increase as domain holders might spend resources on developing their own CZ infrastructure and operations in countries where trading with the Czech Republic is more intense.

Code

df_cc2 |>
  mutate(mx_log = log(mx + 1)) |>
  select(mx_log,
         distance_log,
         population_log,
         gdp_log,
         export_log,
         import_log) |>
  drop_units() |>
  drop_na() |>
  cor() |>
  corrplot(method      = "color",
           type        = "lower",
           diag        = FALSE,
           addCoef.col = "Black")

Results

Because the distribution of domain counts was over-dispersed and showed high zero-inflation (i.e., there were many countries with no mail servers indicated in the domains’ MX records), we used zero-inflated negative binomial distribution to model both the count of domains with MX records and the proportion of zeros.

What is zero-inflated negative binomial distribution?

Zero-inflated negative binomial regression can be used for modeling count variables with excessive zeros (i.e., zero-inflation) where the distribution of the count response variable also shows overdispersion (i.e., its variability is greater than would be expected in more classical models like Poisson). Furthermore, the excess zeros can be conceptualized as generated by a separate process from the count values and can be, therefore, modeled independently from the count of the response variable.

Models 1-3

Code

m1 <-
  brm(
    bf(
      mx ~ 
        log(distances + 1) +
        log(population + 1) +
        log(gdp + 1) +
        log(export + 1) +
        log(import + 1),
      zi ~
        log(distances + 1) +
        log(population + 1) +
        log(gdp + 1) +
        log(export + 1) +
        log(import + 1)),
    family = zero_inflated_negbinomial(),
    data   = df_cc2,
    prior  = c(prior(normal(0,2), class = Intercept),
               prior(normal(0,2), class = b)),
    chains = 4,
    iter   = 4000,
    warmup = 1000,
    cores  = parallel::detectCores())
saveRDS(m1, file = "m1.rds")

Code

m2 <-
  brm(bf(mx ~ log(distances + 1),
         zi ~ log(distances + 1)),
    family = zero_inflated_negbinomial(),
    data   = df_cc2,
    prior  = c(prior(normal(0,2), class = Intercept),
               prior(normal(0,2), class = b)),
    chains = 4,
    iter   = 4000,
    warmup = 1000,
    cores  = parallel::detectCores())
saveRDS(m2, file = "m2.rds")

Code

m3 <-
  brm(bf(mx ~ log(population + 1),
         zi ~ log(population + 1)),
    family = zero_inflated_negbinomial(),
    data   = df_cc2,
    prior  = c(prior(normal(0,2), class = Intercept),
               prior(normal(0,2), class = b)),
    chains = 4,
    iter   = 4000,
    warmup = 1000,
    cores  = parallel::detectCores())
saveRDS(m3, file = "m3.rds")

In model m1, we used all five predictors to estimate the count of domains with MX records pointing to mail servers located in foreign countries and the probability that no mail servers are located in the respective countries. However, as we observed severe issues with multicollinearity (when predictors correlate with each other highly, the resulting model parameters become unreliable), we do not discuss model m1 (for the code and results, see the Appendix), and focus on modeling each predictor separately.

For the sake of brevity, we also refer the reader to the Appendix for the results of models m2 (geographic distance) and m3 (population size). While they found associations with the count of domains and their zero-inflation, they performed less efficiently than other models (see below) and found weaker effects than model m4 (GDP per capita).

Model 4: GDP per capita

Code

m4 <-
  brm(bf(mx ~ log(gdp + 1),
         zi ~ log(gdp + 1)),
    family = zero_inflated_negbinomial(),
    data   = df_cc2,
    prior  = c(prior(normal(0,2), class = Intercept),
               prior(normal(0,2), class = b)),
    chains = 4,
    iter   = 4000,
    warmup = 1000,
    cores  = parallel::detectCores())
saveRDS(m4, file = "m4.rds")

Code

summary(m4)

 Family: zero_inflated_negbinomial 
  Links: mu = log; shape = identity; zi = logit 
Formula: mx ~ log(gdp + 1) 
         zi ~ log(gdp + 1)
   Data: df_cc2 (Number of observations: 188) 
  Draws: 4 chains, each with iter = 4000; warmup = 1000; thin = 1;
         total post-warmup draws = 12000

Regression Coefficients:
             Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
Intercept      -16.15      2.41   -20.65   -11.13 1.00     6951     6444
zi_Intercept    15.49      5.52     6.00    27.51 1.00     6252     4895
loggdpP1         2.41      0.25     1.89     2.89 1.00     7249     7126
zi_loggdpP1     -1.99      0.69    -3.53    -0.86 1.00     5907     4392

Further Distributional Parameters:
      Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
shape     0.11      0.02     0.08     0.15 1.00     7377     6768

Draws were sampled using sampling(NUTS). For each parameter, Bulk_ESS
and Tail_ESS are effective sample size measures, and Rhat is the potential
scale reduction factor on split chains (at convergence, Rhat = 1).

How to read the results?

The results summary table above presents a great deal of information. However, only a part of the section titled Population-Level Effects is usually of prime interest.

First, the reader should focus on the columns Estimate (mean of the posterior distribution for respective predictors), l-95% CI (lower bound of the 95% credible interval), and u-95% CI (upper bound of the 95% credible interval). As noted above, these values help us interpret the parameters’ posterior distributions.

The mean estimate describes the central tendency of the parameter and can be interpreted as the value where the association between the predictor and the response variable probably lies. Then, it represents how much the values of the response change depending on the values of the predictor. Note that this value is not a single point-estimate, but a mean of many possible regression lines that the model fitted on the data.
Furthermore, we are interested in the 95% credible intervals (95% CIs) which capture where 95% of parameters’ posterior distribution lie. CIs are described by their lower (l-95% CI) and upper (u-95% CI) bound (i.e., 95% of the parameter’s distribution lie somewhere between its lower and upper bound). When looking at credible intervals, we are interested whether they contain zero—if they do, we can conclude that there is no relationship between the variables; if they don’t, there possibly is a relationship. For example, a parameter with mean M = 2.41 and 95% credible interval between 1.90 and 2.88 does not contain zero and therefore suggests a positive relationship between the response and predictor variables. However, a parameter with mean M = 2.41 and 95% credible interval between -1.90 and 2.88 contains zero, suggesting a null effect.

Then, in the same section (Population-Level Effects), we are mostly interested in rows describing the parameters for the investigated predictors (here labelled as loggdpP1 and zi_loggdpP1). These two describe the respective relationships between the response and predictor variables. For loggdpP1, we see that the effect is positive as the 95% CI lies above zero, suggesting that GDP per capita increases the count of domains. For zi_loggdpP1 (the zi stand for zero-inflation), we observe a negative effect as the 95% CI lies below zero, suggesting that GDP per capita decreases the probability of observing zero domains with MX records.

However, the reported parameter values are not intuitively interpretable as the predictors have been log-transformed in the formula of the model, the model utilized a log link function to map the predictor values on the response values, and a logit link function to map the predictor values on the zero-inflation probability. In the case of predictors for the count of domains, we can interpret the results as a percent increase. For example, in model m4, 1% increase in GDP per capita associates with 2.4% increase in the count of domains with MX records. For any x percent increase, one has to calculate 1.x to the power of the coefficient, subtract 1, and multiply by 100. For example, a 30% increase in the GDP per capita results in 87.7% increase in the count of domains ((1.30^2.40 - 1) * 100 = 87.7).

Code

m4_plot <- 
  plot(
    conditional_effects(m4,
                        ndraws    = 100,
                        spaghetti = TRUE,
                        dpar      = "mu"),
    points = TRUE,
    plot   = FALSE) [[1]] +
  ylim(0,1000000) +
  ylab("Domain count") +
  scale_y_continuous(labels = scales::label_number(scale_cut = cut_short_scale())) +
  xlab("GDP per capita (USD)") +
  scale_x_continuous(labels = scales::label_number(scale_cut = cut_short_scale())) +
  theme(text = element_text(size = 20))
m4_plot_zi <- 
  plot(
    conditional_effects(m4,
                        ndraws    = 100,
                        spaghetti = TRUE,
                        dpar      = "zi"),
    plot   = FALSE) [[1]] +
  ylim(0,1) +
  ylab("Zero-inflation probability") +
  xlab("GDP per capita (USD)") +
  scale_x_continuous(labels = scales::label_number(scale_cut = cut_short_scale())) +
  theme(text = element_text(size = 20))

m4_plot_zi
m4_plot

How to read the charts?

Naturally, charts provide an important visual aid for understanding the relationships between the predictor and response variables. In a nutshell, the (blue) regression lines help us understand how the values of the response variable on the y-axis change depending on the values of the predictor on the x-axis (i.e., what is their association).

Note that for the domain count (right subplot), both variables are on their original scale. For the zero-inflation (left subplot), the GDP per capita is on its original scale, while the values for the probability of zero-inflation are mapped between 0 and 1, using a logit link.

In the domain count chart (right), we can also inspect a scatter of the countries in which the mail servers are set up. Here, each dot represents one country positioned accordingly to the country’s GDP per capita and the respective count of domains with MX records. Note that not all countries are represented in these charts, as the y-axis has been cut to focus on the regression lines.

Crucially, as we used Bayesian regression analysis, we have not obtained only one regression line, which would be fitted to the data-points, but a whole distribution of lines. Therefore, we may observe a number of lines that have been sampled from the posterior distribution. This sample helps to visualize the distribution of a given parameter—lines get more dense closer to the middle (where the mean is represented by one thick white line). Furthermore, the sample is also bounded by the 95% CI. Together, they illustrate both the relationship between the variables and the model’s uncertainty about this relationship.

Model m4 estimated that GDP per capita associates negatively with the zero-inflation of domains using MX records and positively with their count. Note that the left subplot presents the zero-inflation (descending lines suggest lower probability of observing zero domains using MX records) and the right subplot the count of domains (rising lines suggest higher counts of domains using MX records).

Model 5: Czech export

Code

m5 <-
  brm(bf(mx ~ log(export + 1),
         zi ~ log(export + 1)),
    family = zero_inflated_negbinomial(),
    data   = df_cc2,
    prior  = c(prior(normal(0,2), class = Intercept),
               prior(normal(0,2), class = b)),
    chains = 4,
    iter   = 4000,
    warmup = 1000,
    cores  = parallel::detectCores())
saveRDS(m5, file = "m5.rds")

Code

summary(m5)

 Family: zero_inflated_negbinomial 
  Links: mu = log; shape = identity; zi = logit 
Formula: mx ~ log(export + 1) 
         zi ~ log(export + 1)
   Data: df_cc2 (Number of observations: 188) 
  Draws: 4 chains, each with iter = 4000; warmup = 1000; thin = 1;
         total post-warmup draws = 12000

Regression Coefficients:
               Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
Intercept         -4.02      1.46    -6.71    -0.95 1.00     7745     6234
zi_Intercept      22.21      4.45    14.65    32.04 1.00    10723     7010
logexportP1        0.59      0.07     0.44     0.74 1.00     8162     6406
zi_logexportP1    -1.31      0.26    -1.90    -0.86 1.00    10936     7037

Further Distributional Parameters:
      Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
shape     0.16      0.02     0.12     0.21 1.00    10282     8845

Draws were sampled using sampling(NUTS). For each parameter, Bulk_ESS
and Tail_ESS are effective sample size measures, and Rhat is the potential
scale reduction factor on split chains (at convergence, Rhat = 1).

Code

m5_plot <- 
  plot(
    conditional_effects(m5,
                        ndraws    = 100,
                        spaghetti = TRUE,
                        dpar      = "mu"),
    points = TRUE,
    plot   = FALSE) [[1]] +
  ylim(0,180000) +
  ylab("Domain count") +
  scale_y_continuous(labels = scales::label_number(scale_cut = cut_short_scale())) +
  xlab("Czech export (USD)") +
  scale_x_continuous(labels = scales::label_number(scale_cut = cut_short_scale())) +
  theme(text = element_text(size = 20))

m5_plot_zi <- 
  plot(
    conditional_effects(m5,
                        ndraws    = 100,
                        spaghetti = TRUE,
                        dpar      = "zi"),
    plot   = FALSE) [[1]] +
  ylim(0,1) +
  ylab("Zero-inflation probability") +
  xlab("Czech export (USD)") +
  scale_x_continuous(labels = scales::label_number(scale_cut = cut_short_scale())) +
  theme(text = element_text(size = 20))

m5_plot_zi
m5_plot

Model m5 estimated that Czech export associates negatively with the zero-inflation of domains using MX records and positively with their count.

Model 6: Foreign import

Code

m6 <-
  brm(bf(mx ~ log(import + 1),
         zi ~ log(import + 1)),
    family = zero_inflated_negbinomial(),
    data   = df_cc2,
    prior  = c(prior(normal(0,2), class = Intercept),
               prior(normal(0,2), class = b)),
    chains = 4,
    iter   = 4000,
    warmup = 1000,
    cores  = parallel::detectCores())
saveRDS(m6, file = "m6.rds")

Code

summary(m6)

 Family: zero_inflated_negbinomial 
  Links: mu = log; shape = identity; zi = logit 
Formula: mx ~ log(import + 1) 
         zi ~ log(import + 1)
   Data: df_cc2 (Number of observations: 188) 
  Draws: 4 chains, each with iter = 4000; warmup = 1000; thin = 1;
         total post-warmup draws = 12000

Regression Coefficients:
               Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
Intercept         -1.28      1.16    -3.34     1.18 1.00     8022     7237
zi_Intercept      12.84      2.45     8.74    18.38 1.00    10260     6305
logimportP1        0.47      0.06     0.35     0.58 1.00     8994     8070
zi_logimportP1    -0.81      0.16    -1.17    -0.56 1.00     9756     5743

Further Distributional Parameters:
      Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
shape     0.15      0.02     0.11     0.20 1.00    10076     8223

Draws were sampled using sampling(NUTS). For each parameter, Bulk_ESS
and Tail_ESS are effective sample size measures, and Rhat is the potential
scale reduction factor on split chains (at convergence, Rhat = 1).

Code

m6_plot <- 
  plot(
    conditional_effects(m6,
                        ndraws    = 100,
                        spaghetti = TRUE,
                        dpar      = "mu"),
    points = TRUE,
    plot   = FALSE) [[1]] +
  ylab("Domain count") +
  scale_y_continuous(labels = scales::label_number(scale_cut = cut_short_scale())) +
  xlab("Foreign import from CZ (USD)") +
  scale_x_continuous(labels = scales::label_number(scale_cut = cut_short_scale())) +
  theme(text = element_text(size = 20))

m6_plot_zi <- 
  plot(
    conditional_effects(m6,
                        ndraws    = 100,
                        spaghetti = TRUE,
                        dpar      = "zi"),
    plot   = FALSE) [[1]] +
  ylim(0,1) +
  ylab("Zero-inflation probability") +
  xlab("Foreign import from CZ (USD)") +
  scale_x_continuous(labels = scales::label_number(scale_cut = cut_short_scale())) +
  theme(text = element_text(size = 20))

m6_plot_zi
m6_plot

Model m6 estimated that foreign import from the Czech Republic associates negatively with the zero-inflation of domains using MX records and positively with their count.

Model performance comparison

Code

comp1 <- compare_performance(m2, m3, m4, m5, m6,
                             metrics = c("LOOIC", "WAIC"))
saveRDS(comp1, file = "comp1.rds")

Code

comp1 <- readRDS(file = "comp1.rds")
comp1

# Comparison of Model Performance Indices

Name |   Model |     ELPD | ELPD_SE | LOOIC (weights) | LOOIC_SE | WAIC (weights)
---------------------------------------------------------------------------------
m2   | brmsfit | -704.286 |  60.981 |  1408.6 (<.001) |  121.962 | 1407.8 (<.001)
m3   | brmsfit | -709.622 |  60.623 |  1419.2 (<.001) |  121.246 | 1418.9 (<.001)
m4   | brmsfit | -670.409 |  59.168 |  1340.8 (0.172) |  118.335 | 1339.7 (<.001)
m5   | brmsfit | -650.639 |  59.979 |  1301.3 (0.828) |  119.957 | 1299.7 (>.999)
m6   | brmsfit | -660.227 |  60.028 |  1320.5 (<.001) |  120.057 | 1317.2 (<.001)

How to read the performance comparisons?

ELPD: Expected log predictive density. Larger ELPD values mean better fit.

LOOIC: Leave-one-out cross-validation (LOO) information criterion. Lower LOOIC values mean better fit.

WAIC: Widely applicable information criterion. Lower WAIC values mean better fit.

When compared, model m5 (using the export predictor, logexportP1) seems to be the best fit for the data. The model m6 (using the import predictor, logimportP1) ranked second, followed by the model m4 (using the GDP per capita predictor, loggdpP1). Models m2 (distance, logdistancesP1) and m3 (population, logpopulationP1) fared worse than models m4, m5, and m6.

Discussion

We investigated what factors might motivate domain holders to set up mail servers outside of the Czech Republic. We explored the influence of (1) foreign countries’ geographic distance from the Czech Republic, (2) the foreign countries’ population size, (3) the foreign countries’ GDP per capita, (4) Czech Republic’s export to the respective foreign countries, and (5) foreign countries’ import from the Czech Republic. We found that all of these predictors are associated with the counts of domains that use MX records pointing to foreign mail servers, however, due to multicollinearity issues, we were not able to estimate them within one model.

As our models predicted both the count of domains and also their zero-inflation (in many foreign countries, there are no mail servers for CZ domains), we found that nearly all of the predictors had complementary associations with both of these processes. For example, as foreign countries’ GDP per capita increased, the count of domains using mail servers abroad increased, while the probability of operating zero mail servers from abroad decreased. In other words, economically stronger countries had lower chances of having no mail servers outsourced from the Czech Republic and also higher count of such mail servers. This complementary trend applies to all of the predictors.

Interestingly, GDP per capita found the strongest associations with the count of domains that use foreign mail server services and their zero-inflation; however, models using the export and import values provided a better fit for the data.

Knowing that these modeling results are correlational associations and cannot suggest causal links, we can summarize that domain holders’ interest in foreign mail server services:

Decreases with rising geographic distance from the Czech Republic.
Increases with rising population size of foreign countries (However, the positive association with the count of domains that use foreign mail servers is not certain.).
Increases with rising GDP per capita of foreign countries.
Increases with rising value of Czech export to foreign countries.
Increases with rising value of foreign import from the Czech Republic.

Still, we may speculate about the causal processes behind the associations observed for export and import values. As we noted above, in comparison to the “passive” proxies of geographic distance, population size, and GDP per capita, export and import values reflect “active”, intentionally directed channels in human networks. Then, intense trading between Czech and foreign companies may motivate these actors to increasingly spend resources on developing larger mail server infrastructure. However, to establish these causal effects, further analysis using time series would be needed (given that natural variation in export and import values would provide sufficient data for such a natural experiment).

Autonomous systems perspective

In this section, we explored domains using MX records in respect to autonomous systems, using data provided by CZ.NIC’s DNS crawler endpoint /crawler_mail_asn. To provide further detail, we also paired these data with public information on autonomous systems holders and their respective domicile.

Code

df_asn <- fromJSON(api_endpoint("/crawler_mail_asn"))
df_asn <- df_asn |>
  mutate(ts = as.Date(ts)) |>
  group_by(asn) |>
  filter(ts <= "2023-06-15") |>
  filter(ts == max(ts)) |>
  ungroup() |>
  rename("mx" = "domains")

asn <-
  read_delim("CSV/asn.txt",
             delim         = "_", #intentionally incorrect
             escape_double = FALSE,
             trim_ws       = TRUE)
asn <- asn |>
  rename(col = "0 -Reserved AS-, ZZ") |>
  mutate(col = sub(" ", "___", col))
asn$col <- sub(",([^,]*)$", " 9199999\\1", asn$col)
asn <- asn |>
  separate_wider_delim(
    col,
    delim    = "___",
    names    = c("asn", "col")
    )
asn <- asn |>
  separate_wider_delim(
    col,
    delim = " 9199999 ",
    names = c("holder", "country"),
    too_few = "debug"
    ) |>
  select(asn, holder, country)

df_asn <- left_join(df_asn, asn, by = "asn") |>
  rename("cc" = "country") |>
  mutate(cc_iso = countrycode(cc,
                              origin = "iso2c",
                              destination = "iso3c"))

df_asn <- left_join(df_asn, country_flags, by = "cc")

Distribution of domains with MX records

Table 3 presents the distribution of domains using MX records in all autonomous systems with at least one domain with MX record. The first row shows statistics for all autonomous systems operated by companies with domicile in the Czech Republic, while the second row specifies values for all autonomous systems with foreign holders.

Code

cz <- df_asn |>
  filter(cc == "CZ") |>
  summarize(Subset  = "CZ",
            Min     = min(mx),
            Mean    = mean(mx),
            Median  = median(mx),
            Max     = max(mx),
            Total   = sum(mx),
            Percent = paste0(
                        round(sum(mx) /
                              sum(df_asn$mx) *
                              100, 2),
                        "%")
  )
foreign <- df_asn |>
  filter(cc != "CZ") |>
  summarize(Subset  = "Foreign",
            Min     = min(mx),
            Mean    = mean(mx),
            Median  = median(mx),
            Max     = max(mx),
            Total   = sum(mx),
            Percent = paste0(
                        round(sum(mx) /
                              sum(df_asn$mx) *
                              100, 2),
                        "%")
    )

descriptives_table <- bind_rows(cz, foreign)
descriptives_table |>
  kable("html",
        digits = 2,
        format.args = list(big.mark = " ")) |>
  kable_styling(
    bootstrap_options = c("striped",
                          "hover",
                          "condensed",
                          "responsive"),
    full_width = TRUE,
    position   = "left")

Table 3: Distribution of domains with MX records across autonomous systems.
Subset	Min	Mean	Median	Max	Total	Percent
CZ	1	2 557.87	39	215 911	1 222 660	75.26%
Foreign	1	126.89	2	58 605	365 050	22.47%

Next, to illustrate the absolute count of domains using MX records across autonomous systems, we filtered autonomous systems with more than 1000 domains with MX records and plotted a barchart (see Figure 7). Note that red-colored bars represent autonomous systems with domicile in the Czech Republic, while the blue-colored bars all autonomous systems with domicile outside of the Czech Republic (with one teal-colored bar for unknown autonomous systems). Similarly as above, the chart illustrates the unsurprising dominance of autonomous systems with domicile in the Czech Republic.

Figure 7: Domains with MX records by ASN (>1000).

A longer list of autonomous systems with more than 500 domains using MX records can be inspected in Table 5 in the Appendix. To provide a separate insight into the representation of foreign domains using MX records, Table 5 also presents a column Domains' percentage (no CZ) which specifies the percentage of all domains with MX records excluding the autonomous systems with domicile within the Czech Republic. Similarly, the table also presents a column Domains' percentage (CZ only) specifying the percentage of all domains using MX records with domicile within the Czech Republic only.

Modeling mail servers outside of the Czech Republic, part 2

As in the country-focused section, we tested the associations between the count of domains with MX records pointing to mail servers outside of the Czech Republic and the same set of predictors. However, the fact that autonomous systems often span across the borders of many countries makes the assumption of domains with mail servers outside of the Czech Republic somewhat complicated. As a workaround, we used the autonomous systems’ holders domicile as a key by which we paired the counts of domains with MX records to the predictor values. While problematic—the location of autonomous systems’ holders domicile may easily differ from the actual geolocations of mail servers’ IP addresses—, this approach enabled us to use the country-specific predictors to model the count of domains.

Code

df_asn2 <- df_asn |>
  select(asn, mx, holder, cc_iso)
df_asn2 <- full_join(df_asn2, world,
                     by           = "cc_iso",
                     multiple     = "all",
                     relationship = "many-to-many",
                     na_matches   = "na")
df_asn2 <- full_join(df_asn2, pop,
                     by           = "cc_iso",
                     multiple     = "all",
                     relationship = "many-to-many",
                     na_matches   = "na")
df_asn2 <- full_join(df_asn2, gdp,
                     by           = "cc_iso",
                     multiple     = "all",
                     relationship = "many-to-many",
                     na_matches   = "na")
df_asn2 <- full_join(df_asn2, export,
                     by           = "cc_iso",
                     multiple     = "all",
                     relationship = "many-to-many",
                     na_matches   = "na")
df_asn2 <- full_join(df_asn2, import,
                     by           = "cc_iso",
                     multiple     = "all",
                     relationship = "many-to-many",
                     na_matches   = "na")
df_asn2 <- df_asn2 |>
  select(-iso_a3) |>
  mutate(mx = replace_na(mx, 0),
         export  = replace_na(export, 0)) |>
  filter(cc_iso != "CZE") |>
  mutate(population = replace(population,
                              cc_iso == "TWN",
                              23859912),
         gdp        = replace(gdp,
                              cc_iso == "TWN",
                              33186.34),
         gdp        = replace(gdp,
                              cc_iso == "VEN",
                              2071.6),
         gdp        = replace(gdp,
                              cc_iso == "KWT",
                              28884.3),
         gdp        = replace(gdp,
                              cc_iso == "TKM",
                              10111),
         gdp        = replace(gdp,
                              cc_iso == "YEM",
                              549.49),
         gdp        = replace(gdp,
                              cc_iso == "CUB",
                              7291),
         gdp        = replace(gdp,
                              cc_iso == "GRL",
                              54571),
         gdp        = replace(gdp,
                              cc_iso == "ERI",
                              614),
         gdp        = replace(gdp,
                              cc_iso == "PRK",
                              654),
         distances  = replace(distances,
                              cc_iso == "SYC",
                              7165900),
         distances  = replace(distances,
                              cc_iso == "HKG",
                              8750280),
         distances  = replace(distances,
                              cc_iso == "AND",
                              1330730),
         name       = replace(name,
                              cc_iso == "HKG",
                              "Hong Kong"),
         name       = replace(name,
                              cc_iso == "SYC",
                              "Seychelles")) |>
  drop_units() |>
  mutate(distances = distances/1000) |>
  mutate(distance_log   = log(distances + 1),
         population_log = log(population + 1),
         gdp_log        = log(gdp + 1),
         export_log     = log(export + 1),
         import_log     = log(import + 1),
         cc_iso         = as.factor(cc_iso)) |>
  drop_na(c(distance_log,
            population_log,
            gdp_log,
            export_log,
            import_log))

Results

Model 7: All predictors

Initially, we ran model m7 (see Appendix) using the negative binomial distribution to predict the count of domains using all five predictors with varying intercepts for the domicile of the autonomous systems’ holders. However, the posterior predictive check revealed that the model struggled to capture the inflation of autonomous systems with only one domain and simultaneously overestimated the count of zeros (see Figure 20 in the Appendix). Therefore, we opted for the use of zero-inflated negative binomial distribution to predict the count of domains in another model m8. However, as the inflation was not observed for zeros but for ones, we discarded zero observations from the dataset (there were 110 observations out of a total of 2898) and subtracted 1 from the count of domains, which enabled us to model the one-inflation of the domains’ count and obtain a better fit for the data. Furthermore, high multicollinearity was detected for the export predictor (and medium multicollinearity values for distance, population, and import predictors); therefore, we omitted the export and also import (as it was highly corelated with the export) predictors in model m8, and modeled them separately in m9 and m10.

What is posterior predictive check?

A posterior predictive check is a method for comparing the values that the model predicts and the actual values upon which the model was estimated. The goal is to inspect whether the fitted model adequately describes the observed data. Put simply, if the data predicted by the posterior predictive check deviate from the actual observed data, we should worry and try to find a better approach for modeling the relationships.

Model 8: Distance, population, and GDP per capita

Code

df_asn3 <- df_asn2 |>
  mutate(mx = mx - 1) |>
  filter(mx>=0)
m8 <- brm(
  bf(mx ~
       log(distances + 1) +
       log(population + 1) +
       log(gdp + 1) +
       (1|cc_iso),
     zi ~ 
       log(distances + 1) +
       log(population + 1) +
       log(gdp + 1) +
       (1|cc_iso)),
  family  = zero_inflated_negbinomial(),
  data    = df_asn3,
  prior   = c(prior(normal(0,2), class = Intercept),
              prior(normal(0,2), class = b)),
  chains  = 4,
  iter    = 4000,
  warmup  = 1000,
  cores   = parallel::detectCores(),
  backend = "cmdstanr",
  threads = threading(2),
  control = list(adapt_delta = 0.99))
saveRDS(m8, file = "m8.rds")

Code

summary(m8)

 Family: zero_inflated_negbinomial 
  Links: mu = log; shape = identity; zi = logit 
Formula: mx ~ log(distances + 1) + log(population + 1) + log(gdp + 1) + (1 | cc_iso) 
         zi ~ log(distances + 1) + log(population + 1) + log(gdp + 1) + (1 | cc_iso)
   Data: df_asn3 (Number of observations: 2860) 
  Draws: 4 chains, each with iter = 4000; warmup = 1000; thin = 1;
         total post-warmup draws = 12000

Multilevel Hyperparameters:
~cc_iso (Number of levels: 79) 
                 Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
sd(Intercept)        1.44      0.18     1.13     1.83 1.00     3438     6048
sd(zi_Intercept)     1.09      0.95     0.04     3.55 1.00     4592     4870

Regression Coefficients:
                   Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
Intercept             -9.94      3.72   -17.25    -2.43 1.00     3144     5441
zi_Intercept         -65.81     31.89  -143.80   -21.43 1.00     3996     2776
logdistancesP1         0.20      0.20    -0.19     0.59 1.00     2653     4703
logpopulationP1        0.09      0.13    -0.16     0.34 1.00     3175     5477
loggdpP1               0.95      0.23     0.49     1.41 1.00     2985     5593
zi_logdistancesP1      1.26      1.19    -1.30     3.39 1.00     3667     3168
zi_logpopulationP1     0.75      0.95    -0.33     3.21 1.00     3484     2541
zi_loggdpP1            3.38      2.31     0.11     8.92 1.00     5275     3869

Further Distributional Parameters:
      Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
shape     0.14      0.00     0.13     0.15 1.00    11450     7302

Draws were sampled using sample(hmc). For each parameter, Bulk_ESS
and Tail_ESS are effective sample size measures, and Rhat is the potential
scale reduction factor on split chains (at convergence, Rhat = 1).

Code

m8_plot_1 <- 
  plot(
    conditional_effects(m8,
                        effect    = "distances",
                        ndraws    = 100,
                        spaghetti = TRUE,
                        dpar      = "mu"),
    points = TRUE,
    plot   = FALSE) [[1]] +
  ylab("Domain count") +
  scale_y_continuous(labels = scales::label_number(scale_cut = cut_short_scale())) +
  ylim(0,500) +
  xlab("Geographic distance (km)") +
  scale_x_continuous(labels = scales::label_number(scale_cut = cut_short_scale())) +
  theme(text = element_text(size = 20))
m8_plot_2 <- 
  plot(
    conditional_effects(m8,
                        effect    = "population",
                        ndraws    = 100,
                        spaghetti = TRUE,
                        dpar      = "mu"),
    points = TRUE,
    plot   = FALSE) [[1]] +
  ylab("Domain count") +
  scale_y_continuous(labels = scales::label_number(scale_cut = cut_short_scale())) +
  ylim(0,500) +
  xlab("Population size") +
  scale_x_continuous(labels = scales::label_number(scale_cut = cut_short_scale())) +
  theme(text = element_text(size = 20))
m8_plot_3 <- 
  plot(
    conditional_effects(m8,
                        effect    = "gdp",
                        ndraws    = 100,
                        spaghetti = TRUE,
                        dpar      = "mu"),
    points = TRUE,
    plot   = FALSE) [[1]] +
  ylab("Domain count") +
  scale_y_continuous(labels = scales::label_number(scale_cut = cut_short_scale())) +
  ylim(0,500) +
  xlab("GDP per capita (USD)") +
  scale_x_continuous(labels = scales::label_number(scale_cut = cut_short_scale())) +
  theme(text = element_text(size = 20))


m8_plot_1_zi <- 
  plot(
    conditional_effects(m8,
                        effect    = "distances",
                        ndraws    = 100,
                        spaghetti = TRUE,
                        dpar      = "zi"),
    plot   = FALSE) [[1]] +
  ylim(0,1) +
  ylab("One-inflation probability") +
  xlab("Geographic distance (km)") +
  scale_x_continuous(labels = scales::label_number(scale_cut = cut_short_scale())) +
  theme(text = element_text(size = 20))
m8_plot_2_zi <- 
  plot(
    conditional_effects(m8,
                        effect    = "population",
                        ndraws    = 100,
                        spaghetti = TRUE,
                        dpar      = "zi"),
    plot   = FALSE) [[1]] +
  ylim(0,1) +
  ylab("One-inflation probability") +
  xlab("Population size") +
  scale_x_continuous(labels = scales::label_number(scale_cut = cut_short_scale())) +
  theme(text = element_text(size = 20))
m8_plot_3_zi <- 
  plot(
    conditional_effects(m8,
                        effect    = "gdp",
                        ndraws    = 100,
                        spaghetti = TRUE,
                        dpar      = "zi"),
    plot   = FALSE) [[1]] +
  ylim(0,1) +
  ylab("One-inflation probability") +
  xlab("GDP per capita (USD)") +
  scale_x_continuous(labels = scales::label_number(scale_cut = cut_short_scale())) +
  theme(text = element_text(size = 20))

m8_plot_1_zi
m8_plot_1
m8_plot_2_zi
m8_plot_2
m8_plot_3_zi
m8_plot_3

(a) Geographic distance (one-inflation).

Model m8 estimated that GDP per capita positively associates with the count of domains and—strangely—also with their “zero”-inflation (which, in fact, was a recoded one-inflation). Other predictors estimated null associations.

The relationships between the predictors and the count of domains are plotted in Figure 8 above. Note that the left column plots the one-inflation (rising lines suggest higher probability of observing one domain using MX records) and the right column the count of domains (rising lines suggest higher counts of domains using MX records, for all values above one).

Model 9: Czech export

Code

m9 <- brm(
  bf(mx ~ log(export + 1) + (1|cc_iso),
     zi ~ log(export + 1) + (1|cc_iso)),
  family  = zero_inflated_negbinomial(),
  data    = df_asn3,
  prior   = c(prior(normal(0,2), class = Intercept),
              prior(normal(0,2), class = b)),
  chains  = 4,
  iter    = 4000,
  warmup  = 1000,
  cores   = parallel::detectCores(),
  backend = "cmdstanr",
  threads = threading(2),
  control = list(adapt_delta = 0.999))
saveRDS(m9, file = "m9.rds")

Code

summary(m9)

 Family: zero_inflated_negbinomial 
  Links: mu = log; shape = identity; zi = logit 
Formula: mx ~ log(export + 1) + (1 | cc_iso) 
         zi ~ log(export + 1) + (1 | cc_iso)
   Data: df_asn3 (Number of observations: 2860) 
  Draws: 4 chains, each with iter = 4000; warmup = 1000; thin = 1;
         total post-warmup draws = 12000

Multilevel Hyperparameters:
~cc_iso (Number of levels: 79) 
                 Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
sd(Intercept)        1.58      0.19     1.24     2.00 1.00     2705     5662
sd(zi_Intercept)     2.03      1.17     0.13     4.68 1.00     1525     2819

Regression Coefficients:
               Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
Intercept         -2.05      2.38    -6.73     2.69 1.00     2841     4948
zi_Intercept     -18.28     19.56   -67.18     7.09 1.00     4848     3907
logexportP1        0.23      0.11     0.00     0.45 1.00     2596     4667
zi_logexportP1     0.50      0.85    -0.64     2.59 1.00     4657     3836

Further Distributional Parameters:
      Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
shape     0.14      0.00     0.13     0.15 1.00     2543     6299

Draws were sampled using sample(hmc). For each parameter, Bulk_ESS
and Tail_ESS are effective sample size measures, and Rhat is the potential
scale reduction factor on split chains (at convergence, Rhat = 1).

Code

m9_plot <- 
  plot(
    conditional_effects(m9,
                        ndraws    = 100,
                        spaghetti = TRUE,
                        dpar      = "mu"),
    points = TRUE,
    plot   = FALSE) [[1]] +
  ylab("Domain count") +
  scale_y_continuous(labels = scales::label_number(scale_cut = cut_short_scale())) +
  ylim(0,500) +
  xlab("Czech export (USD)") +
  scale_x_continuous(labels = scales::label_number(scale_cut = cut_short_scale())) +
  theme(text = element_text(size = 20))

m9_plot_zi <- 
  plot(
    conditional_effects(m9,
                        ndraws    = 100,
                        spaghetti = TRUE,
                        dpar      = "zi"),
    plot   = FALSE) [[1]] +
  ylim(0,1) +
  ylab("One-inflation probability") +
  xlab("Czech export (USD)") +
  scale_x_continuous(labels = scales::label_number(scale_cut = cut_short_scale())) +
  theme(text = element_text(size = 20))

m9_plot_zi
m9_plot

Model m9 estimated a positive relationship between the Czech export and the count of domains and a null relationship with their one-inflation.

Model 10: Foreign import

Code

m10 <- brm(
  bf(mx ~ log(import + 1) + (1|cc_iso),
     zi ~ log(import + 1) + (1|cc_iso)),
  family  = zero_inflated_negbinomial(),
  data    = df_asn3,
  prior   = c(prior(normal(0,2), class = Intercept),
              prior(normal(0,2), class = b)),
  chains  = 4,
  iter    = 4000,
  warmup  = 1000,
  cores   = parallel::detectCores(),
  backend = "cmdstanr",
  threads = threading(2),
  control = list(adapt_delta = 0.999))
saveRDS(m10, file = "m10.rds")

Code

summary(m10)

 Family: zero_inflated_negbinomial 
  Links: mu = log; shape = identity; zi = logit 
Formula: mx ~ log(import + 1) + (1 | cc_iso) 
         zi ~ log(import + 1) + (1 | cc_iso)
   Data: df_asn3 (Number of observations: 2860) 
  Draws: 4 chains, each with iter = 4000; warmup = 1000; thin = 1;
         total post-warmup draws = 12000

Multilevel Hyperparameters:
~cc_iso (Number of levels: 79) 
                 Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
sd(Intercept)        1.61      0.20     1.26     2.04 1.00     3035     5462
sd(zi_Intercept)     1.96      1.15     0.12     4.53 1.00     2096     4619

Regression Coefficients:
               Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
Intercept          0.01      2.01    -3.96     4.03 1.00     3609     5885
zi_Intercept     -16.09     15.98   -57.07     3.92 1.00     6319     4758
logimportP1        0.13      0.10    -0.06     0.32 1.00     3143     5742
zi_logimportP1     0.41      0.70    -0.51     2.20 1.00     6152     4762

Further Distributional Parameters:
      Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
shape     0.13      0.00     0.13     0.14 1.00     2986     7052

Draws were sampled using sample(hmc). For each parameter, Bulk_ESS
and Tail_ESS are effective sample size measures, and Rhat is the potential
scale reduction factor on split chains (at convergence, Rhat = 1).

Code

m10_plot <- 
  plot(
    conditional_effects(m10,
                        ndraws    = 100,
                        spaghetti = TRUE,
                        dpar      = "mu"),
    points = TRUE,
    plot   = FALSE) [[1]] +
  ylab("Domain count") +
  scale_y_continuous(labels = scales::label_number(scale_cut = cut_short_scale())) +
  ylim(0,500) +
  xlab("Foreign import from CZ (USD)") +
  scale_x_continuous(labels = scales::label_number(scale_cut = cut_short_scale())) +
  theme(text = element_text(size = 20))

m10_plot_zi <- 
  plot(
    conditional_effects(m10,
                        ndraws    = 100,
                        spaghetti = TRUE,
                        dpar      = "zi"),
    plot   = FALSE) [[1]] +
  ylim(0,1) +
  ylab("One-inflation probability") +
  xlab("Foreign import from CZ (USD)") +
  scale_x_continuous(labels = scales::label_number(scale_cut = cut_short_scale())) +
  theme(text = element_text(size = 20))

m10_plot_zi
m10_plot

Model m10 estimated null relationship between the foreign import from the Czech Republic and the count of domains and also their one-inflation.

Model performance comparison

Code

comp2 <- compare_performance(m8, m9, m10,
                             metrics = c("LOOIC", "WAIC"))
saveRDS(comp2, file = "comp2.rds")

Code

comp2 <- readRDS(file = "comp2.rds")
comp2

# Comparison of Model Performance Indices

Name |   Model |      ELPD | ELPD_SE | LOOIC (weights) | LOOIC_SE |  WAIC (weights)
-----------------------------------------------------------------------------------
m8   | brmsfit | -8812.221 | 154.646 | 17624.4 (0.971) |  309.291 | 17614.6 (0.993)
m9   | brmsfit | -8818.597 | 154.450 | 17637.2 (0.029) |  308.900 | 17625.2 (0.005)
m10  | brmsfit | -8819.024 | 154.894 | 17638.0 (<.001) |  309.789 | 17627.8 (0.001)

The models performed similarly well on the data, although m8 fared the best.

Discussion

Similarly as above, we investigated what might motivate domain holders to set up mail servers “outside” of the Czech Republic. However, in comparison to the models reported in the country-focused section (where the count of domains with MX records was aggregated by the respective countries where the mail server services were located), data used in this analysis aggregated the count of domains with MX records by their respective autonomous systems. Furthermore, the predictor variables used in this analysis were paired to the domain counts by the domicile of the holders that operate the autonomous systems.

We found that foreign countries’ GDP per capita and the value of Czech export were associated with the count of domains with MX records pointing to mail servers operated by autonomous systems’ holders with domicile outside of the Czech Republic. Crucially, as the models focusing on autonomous systems did not observe severe multicollinearity issues as models focusing on countries, they provided us with better control over the investigated predictors. Then, we observed that while GDP per capita remained positively associated with the count of domains, the associations with geographic distance and population size disappeared. This difference from previous models seems to make sense as the geographic distance and population size predictors were not paired by the actual geolocation of autonomous systems but by the domicile of the autonomous systems’ holders, making them disconnected from the count of domains in these data.

Although the export and import values had to modeled separately due to multicollinearity issues, we found a positive relationship between the value of Czech export and the count of domains. Therefore, we may suggest that domain holder’s tendency to set up their mail servers “abroad” is positively associated only with economic proxies—at least when using data where the count of domains is aggregated by autonomous systems and the predictors are paired by the domicile of the autonomous systems’ holders.

Modeling domains with MX records among Czech companies

To complement the focus on domains with mail server services set up outside of the Czech Republic, we also modeled the variations in the count of domains within autonomous systems for which their holders have domicile in the Czech Republic. Here, we focused on modeling the relationship between the count of domains with MX records and the financial turnover of autonomous systems’ holders.

To do so, we surveyed the public register—searching for financial statements of respective companies—and created a dataset with values for their turnover. As there were 475 autonomous systems operated by Czech companies, we set an arbitrary limit where we collected turnover data only for companies with more than 100 domains with MX records within their autonomous systems. While we were not able to find turnover values for all of these companies (as the public register does not always provide such data), our final dataset contained turnover values for 117 Czech companies (out of 155 companies with more than 100 domains; ignoring 320 companies with less than 100 domains). The resulting dataset accounted for 975 267 domains with MX records (79.77% of all domains within autonomous systems operated by Czech holders), ignoring 247 321 domains (20.2%) out of which 8 566 domains were operated in autonomous systems with less than 100 domains. Note that such imperfections may bias the results below.

Code

df_companies <- read.csv("CSV/COMPANIES.csv", sep = ",")
df_asn_cz <- df_asn |>
  filter(cc == "CZ")
df_asn_cz |> summarize(sum(mx))
df_asn_cz |> filter(mx>=100) |> summarize(sum(mx))
df_asn_cz |> filter(mx<=100) |> summarize(sum(mx))

df_asn_cz <- left_join(df_companies, df_asn_cz, by = "holder")
df_asn_cz <- df_asn_cz |>
  select(holder, turnover, year, asn, mx) |>
  mutate(holder = sub(" ", "|", holder)) |>
  separate_wider_delim(
    holder,
    delim = "|",
    names = c("AS", "holder")
    )
df_asn_cz |> summarize(sum(mx))

Model 11

To model the relationship between the count of domains with MX records pointing to mail servers operated by holders with domicile in the Czech Republic and the their financial turnover, we used the negative binomial distribution and log-transformed the turnover values. Furthermore, as some companies operated more than one autonomous system, we also specified a varying intercept for the holders of autonomous systems to account for such nested variations.

Code

m11 <-
  brm(mx ~ log(turnover) + (1|holder),
      family = negbinomial(),
      data   = df_asn_cz,
      chains = 4,
      iter   = 4000,
      warmup = 1000)
saveRDS(m11, file = "m11.rds")

Code

summary(m11)

 Family: negbinomial 
  Links: mu = log; shape = identity 
Formula: mx ~ log(turnover) + (1 | holder) 
   Data: df_asn_cz (Number of observations: 117) 
  Draws: 4 chains, each with iter = 4000; warmup = 1000; thin = 1;
         total post-warmup draws = 12000

Multilevel Hyperparameters:
~holder (Number of levels: 101) 
              Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
sd(Intercept)     1.57      0.16     1.29     1.90 1.00     4600     7368

Regression Coefficients:
            Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
Intercept       1.72      1.76    -1.70     5.21 1.00     5639     7696
logturnover     0.30      0.10     0.11     0.49 1.00     5497     7494

Further Distributional Parameters:
      Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
shape     0.88      0.15     0.62     1.20 1.00     4903     6997

Draws were sampled using sampling(NUTS). For each parameter, Bulk_ESS
and Tail_ESS are effective sample size measures, and Rhat is the potential
scale reduction factor on split chains (at convergence, Rhat = 1).

Code

m11_plot <- 
  plot(
    conditional_effects(m11,
                        ndraws    = 100,
                        spaghetti = TRUE),
    points = TRUE,
    plot   = FALSE) [[1]] +
  ylab("Domain count") +
  scale_y_continuous(labels = scales::label_number(scale_cut = cut_short_scale())) +
  ylim(0,40000) +
  xlab("CZ companies' turnover (CZK)") +
  scale_x_continuous(labels = scales::label_number(scale_cut = cut_short_scale()))
m11_plot

Figure 11: Associations between CZ companies’ turnover and domains with MX records.

Model m11 estimated a positive association between the count of domains and the financial turnover of the Czech companies that operate the respective autonomous systems. The relationship is plotted above in Figure 11.

General discussion

In this ADAM report, we explored the distribution of MX records of all second-level domains under CZ across the world, looking into datasets which aggregated the count of domains with MX records by countries and by autonomous systems. Unsurprisingly, we observed that most domains using MX records use mail servers located within the Czech Republic.

We also explored what lies behind the domain holders’ tendency for setting up their mail servers outside of the Czech Republic. When using data which aggregated the count of domains by countries in which the IP addresses of mail servers were geolocated, geographic distance from the Czech Republic, foreign countries’ population size, foreign countries’ GDP per capita, Czech export value, and foreign import value from the Czech Republic all associated with the count of domains.

However, when we switched the perspective and used data where the count of domains was aggregated by autonomous systems, only GDP per capita and the value of Czech export were estimated to associate with the count of domains. On the basis of observed effect sizes and model performance indices, we may summarize that GDP per capita and the value of Czech export best explain the variances in the domain holders’ tendency to set up their mail servers abroad.

Finally, we also modeled whether the count of domains with MX records within autonomous systems that are operated by companies with domicile in the Czech Republic are connected to the financial turnover of such companies. We found out that they are indeed positively associated.

Appendix

Models

Here, we present posterior predictive checks, multicollinearity checks, and model summaries of models skipped in the main text.

Model 1

Code

color_scheme_set("brightblue")
pp_check(m1, type = "bars", ndraws = 20)
pp_check(m1, type = "bars", ndraws = 20) +
  xlim(-1,20)

Code

check_collinearity(m1)

# Check for Multicollinearity

Moderate Correlation

         Term  VIF     VIF 95% CI Increased SE Tolerance Tolerance 95% CI
 log(gdp + 1) 5.63 [ 4.46,  7.19]         2.37      0.18     [0.14, 0.22]

High Correlation

                Term   VIF     VIF 95% CI Increased SE Tolerance
  log(distances + 1) 10.77 [ 8.41, 13.87]         3.28      0.09
 log(population + 1) 28.65 [22.17, 37.12]         5.35      0.03
     log(export + 1) 60.39 [46.58, 78.39]         7.77      0.02
     log(import + 1) 29.20 [22.59, 37.83]         5.40      0.03
 Tolerance 95% CI
     [0.07, 0.12]
     [0.03, 0.05]
     [0.01, 0.02]
     [0.03, 0.04]

Code

summary(m1)

 Family: zero_inflated_negbinomial 
  Links: mu = log; shape = identity; zi = logit 
Formula: mx ~ log(distances + 1) + log(population + 1) + log(gdp + 1) + log(export + 1) + log(import + 1) 
         zi ~ log(distances + 1) + log(population + 1) + log(gdp + 1) + log(export + 1) + log(import + 1)
   Data: df_cc2 (Number of observations: 188) 
  Draws: 4 chains, each with iter = 4000; warmup = 1000; thin = 1;
         total post-warmup draws = 12000

Regression Coefficients:
                   Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
Intercept            -28.10      5.40   -38.81   -17.65 1.00     5986     7296
zi_Intercept           1.97     45.83   -58.23   127.47 1.00     2406      679
logdistancesP1         1.70      0.62     0.50     2.90 1.00     3554     5334
logpopulationP1       -1.09      0.38    -1.84    -0.37 1.00     3392     5068
loggdpP1               0.71      0.49    -0.26     1.64 1.00     3615     4968
logexportP1            1.72      0.50     0.71     2.70 1.00     3803     4800
logimportP1           -0.09      0.28    -0.62     0.46 1.00     5556     6708
zi_logdistancesP1      0.27      5.32   -17.66     5.62 1.00     1130      418
zi_logpopulationP1     3.01      3.20     0.09    12.12 1.00      862      489
zi_loggdpP1            2.15      4.51    -1.31    15.44 1.00     1059      562
zi_logexportP1        -3.24      3.86   -15.15    -0.20 1.00      823      406
zi_logimportP1        -1.24      0.92    -3.46    -0.15 1.00     1564      780

Further Distributional Parameters:
      Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
shape     0.20      0.03     0.15     0.27 1.00     2303     3129

Draws were sampled using sampling(NUTS). For each parameter, Bulk_ESS
and Tail_ESS are effective sample size measures, and Rhat is the potential
scale reduction factor on split chains (at convergence, Rhat = 1).

Model 2

Code

pp_check(m2, type = "bars", ndraws = 20)
pp_check(m2, type = "bars", ndraws = 20) +
  xlim(-1,20)

Code

summary(m2)

 Family: zero_inflated_negbinomial 
  Links: mu = log; shape = identity; zi = logit 
Formula: mx ~ log(distances + 1) 
         zi ~ log(distances + 1)
   Data: df_cc2 (Number of observations: 188) 
  Draws: 4 chains, each with iter = 4000; warmup = 1000; thin = 1;
         total post-warmup draws = 12000

Regression Coefficients:
                  Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
Intercept            12.58      1.87     9.22    16.48 1.00    10781     7353
zi_Intercept        -12.89      3.14   -19.92    -7.59 1.00     8591     4960
logdistancesP1       -0.54      0.23    -1.01    -0.13 1.00     9834     7468
zi_logdistancesP1     1.44      0.35     0.82     2.20 1.00     9251     5369

Further Distributional Parameters:
      Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
shape     0.10      0.02     0.06     0.14 1.00     6430     7416

Draws were sampled using sampling(NUTS). For each parameter, Bulk_ESS
and Tail_ESS are effective sample size measures, and Rhat is the potential
scale reduction factor on split chains (at convergence, Rhat = 1).

Code

m2_plot <- 
  plot(
    conditional_effects(m2,
                        ndraws    = 100,
                        spaghetti = TRUE,
                        dpar      = "mu"),
    points = TRUE,
    plot   = FALSE) [[1]] +
  ylab("Domain count") +
  scale_y_continuous(labels = scales::label_number(scale_cut = cut_short_scale())) +
  xlab("Geographic distance (km)") +
  scale_x_continuous(labels = scales::label_number(scale_cut = cut_short_scale())) +
  theme(text = element_text(size = 20))
m2_plot_zi <- 
  plot(
    conditional_effects(m2,
                        ndraws    = 100,
                        spaghetti = TRUE,
                        dpar      = "zi"),
    plot   = FALSE) [[1]] +
  ylim(0,1) +
  ylab("Zero-inflation probability") +
  xlab("Geographic distance (km)") +
  scale_x_continuous(labels = scales::label_number(scale_cut = cut_short_scale())) +
  theme(text = element_text(size = 20))
m2_plot_zi
m2_plot

(a) Geographic distance (zero-inflation).

Model m2 estimated that geographic distance associates positively with domains’ zero-inflation and negatively with the count of domains using MX records.

Model 3

Code

pp_check(m3, type = "bars", ndraws = 20)
pp_check(m3, type = "bars", ndraws = 20) +
  xlim(-1,20)

Code

summary(m3)

 Family: zero_inflated_negbinomial 
  Links: mu = log; shape = identity; zi = logit 
Formula: mx ~ log(population + 1) 
         zi ~ log(population + 1)
   Data: df_cc2 (Number of observations: 188) 
  Draws: 4 chains, each with iter = 4000; warmup = 1000; thin = 1;
         total post-warmup draws = 12000

Regression Coefficients:
                   Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
Intercept              1.49      3.49    -5.15     8.42 1.00     6849     7259
zi_Intercept          10.61      5.82     0.16    23.09 1.00     5122     4766
logpopulationP1        0.40      0.22    -0.02     0.82 1.00     7076     7370
zi_logpopulationP1    -0.86      0.42    -1.80    -0.16 1.00     4752     4214

Further Distributional Parameters:
      Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
shape     0.06      0.01     0.04     0.08 1.00     6114     6092

Draws were sampled using sampling(NUTS). For each parameter, Bulk_ESS
and Tail_ESS are effective sample size measures, and Rhat is the potential
scale reduction factor on split chains (at convergence, Rhat = 1).

Code

m3_plot <- 
  plot(
    conditional_effects(m3,
                        ndraws    = 100,
                        spaghetti = TRUE,
                        dpar      = "mu"),
    points = TRUE,
    plot   = FALSE) [[1]] +
  ylab("Domain count") +
  scale_y_continuous(labels = scales::label_number(scale_cut = cut_short_scale())) +
  xlab("Population size") +
  scale_x_continuous(labels = scales::label_number(scale_cut = cut_short_scale())) +
  theme(text = element_text(size = 20))
m3_plot_zi <- 
  plot(
    conditional_effects(m3,
                        ndraws    = 100,
                        spaghetti = TRUE,
                        dpar      = "zi"),
    plot   = FALSE) [[1]] +
  ylim(0,1) +
  ylab("Zero-inflation probability") +
  xlab("Population size") +
  scale_x_continuous(labels = scales::label_number(scale_cut = cut_short_scale())) +
  theme(text = element_text(size = 20))
m3_plot_zi
m3_plot

Model m3 estimated that population size associates negatively with domains’ zero-inflation and positively with the count of domains using MX records (albeit the 95% credible interval contained zero, suggesting uncertainty over the existence of such effect).

Model 4

Code

pp_check(m4, type = "bars", ndraws = 20)
pp_check(m4, type = "bars", ndraws = 20) +
  xlim(-1,20)

Model 5

Code

pp_check(m5, type = "bars", ndraws = 20)
pp_check(m5, type = "bars", ndraws = 20) +
  xlim(-1,20)

Model 6

Code

pp_check(m6, type = "bars", ndraws = 20)
pp_check(m6, type = "bars", ndraws = 20) +
  xlim(-1,20)

Model 7

Code

m7 <-
  brm(
    mx ~ 
      distance_log +
      population_log +
      gdp_log +
      export_log +
      import_log +
      (1|cc_iso),
    family  = negbinomial(),
    data    = df_asn2,
    prior   = c(prior(normal(0,2), class = Intercept),
                prior(normal(0,2), class = b)),
    chains  = 4,
    iter    = 4000,
    warmup  = 1000,
    cores   = parallel::detectCores(),
    backend = "cmdstanr",
    threads = threading(2))
saveRDS(m7, file = "m7.rds")

Code

pp_check(m7, type = "bars", ndraws = 20)
pp_check(m7, type = "bars", ndraws = 20) +
  xlim(-1,20)

Code

check_collinearity(m7)

# Check for Multicollinearity

Low Correlation

         Term  VIF     VIF 95% CI Increased SE Tolerance Tolerance 95% CI
 distance_log 4.81 [ 4.53,  5.11]         2.19      0.21     [0.20, 0.22]
      gdp_log 3.82 [ 3.61,  4.05]         1.95      0.26     [0.25, 0.28]

Moderate Correlation

           Term  VIF     VIF 95% CI Increased SE Tolerance Tolerance 95% CI
 population_log 7.19 [ 6.75,  7.65]         2.68      0.14     [0.13, 0.15]
     import_log 5.87 [ 5.52,  6.24]         2.42      0.17     [0.16, 0.18]

High Correlation

       Term   VIF     VIF 95% CI Increased SE Tolerance Tolerance 95% CI
 export_log 13.73 [12.87, 14.65]         3.71      0.07     [0.07, 0.08]

Code

summary(m7)

 Family: negbinomial 
  Links: mu = log; shape = identity 
Formula: mx ~ distance_log + population_log + gdp_log + export_log + import_log + (1 | cc_iso) 
   Data: df_asn2 (Number of observations: 2970) 
  Draws: 4 chains, each with iter = 4000; warmup = 1000; thin = 1;
         total post-warmup draws = 12000

Multilevel Hyperparameters:
~cc_iso (Number of levels: 187) 
              Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
sd(Intercept)     1.30      0.16     1.03     1.64 1.00     2350     4334

Regression Coefficients:
               Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
Intercept        -20.83      2.93   -26.70   -15.20 1.00     3066     5128
distance_log       1.47      0.34     0.81     2.13 1.00     1591     3017
population_log    -0.89      0.24    -1.36    -0.42 1.00     1637     3183
gdp_log            0.03      0.28    -0.53     0.59 1.00     1731     3140
export_log         1.13      0.26     0.63     1.64 1.00     2020     4053
import_log         0.15      0.14    -0.12     0.43 1.00     3231     6047

Further Distributional Parameters:
      Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
shape     0.27      0.01     0.26     0.28 1.00    15909     8435

Draws were sampled using sample(hmc). For each parameter, Bulk_ESS
and Tail_ESS are effective sample size measures, and Rhat is the potential
scale reduction factor on split chains (at convergence, Rhat = 1).

Model 8

Code

pp_check(m8, type = "bars", ndraws = 20)
pp_check(m8, type = "bars", ndraws = 20) +
  xlim(-1,20)

Code

check_collinearity(m8)

# Check for Multicollinearity

Low Correlation

                Term  VIF   VIF 95% CI Increased SE Tolerance Tolerance 95% CI
  log(distances + 1) 1.10 [1.07, 1.16]         1.05      0.91     [0.87, 0.93]
 log(population + 1) 1.15 [1.11, 1.20]         1.07      0.87     [0.83, 0.90]
        log(gdp + 1) 1.10 [1.07, 1.15]         1.05      0.91     [0.87, 0.94]

Model 9

Code

pp_check(m9, type = "bars", ndraws = 20)
pp_check(m9, type = "bars", ndraws = 20) +
  xlim(-1,20)

Model 10

Code

pp_check(m10, type = "bars", ndraws = 20)
pp_check(m10, type = "bars", ndraws = 20) +
  xlim(-1,20)

Model 11

Code

pp_check(m11, type = "ecdf_overlay", ndraws = 20)

Figure 24: Model 11 diagnostics: Poterior predictive check.

Domains using MX records across countries

Code

df_cc <- df_cc |>
  mutate(perc_nocz = mx /
                     (sum(mx) - 1056497)
                     * 100) |>
  mutate(perc_nocz =
           if_else(perc_nocz > 100, NA, perc_nocz))
df_cc |>
  select(country_flag, cc, mx, perc, perc_nocz) |>
  arrange(desc(mx)) |>
  rename("Flag"                        = "country_flag",
         "Country"                     = "cc",
         "Domains' count"              = "mx",
         "Domains' percentage"         = "perc",
         "Domains' percentage (no CZ)" = "perc_nocz") |>
  kable("html",
        digits = 6,
        format.args = list(big.mark = " ")) |>
  kable_styling(
    bootstrap_options = c("striped",
                          "hover",
                          "condensed",
                          "responsive"),
    full_width = TRUE,
    position   = "left")

Table 4: Domains with MX records (complete country list).
Flag	Country	Domains' count	Domains' percentage	Domains' percentage (no CZ)
🇨🇿	CZ	1 056 497	62.959617	NA
🇸🇰	SK	112 651	6.713189	18.123972
🇺🇸	US	95 224	5.674665	15.320211
🇳🇱	NL	71 401	4.254986	11.487424
🇩🇪	DE	49 552	2.952943	7.972225
🇫🇮	FI	47 021	2.802113	7.565022
🇵🇱	PL	43 726	2.605755	7.034903
NA	Unknown	43 220	2.575601	6.953494
🇸🇬	SG	29 561	1.761623	4.755952
🇮🇪	IE	28 049	1.671519	4.512692
🇹🇼	TW	27 327	1.628493	4.396533
🇦🇹	AT	21 870	1.303295	3.518578
🇫🇷	FR	15 342	0.914273	2.468313
🇭🇰	HK	7 887	0.470008	1.268908
🇬🇧	GB	5 661	0.337355	0.910776
🇨🇭	CH	2 135	0.127231	0.343492
🇨🇳	CN	2 049	0.122106	0.329655
🇨🇦	CA	2 021	0.120437	0.325151
🇮🇹	IT	1 835	0.109353	0.295226
🇸🇨	SC	1 758	0.104764	0.282838
🇭🇺	HU	1 516	0.090343	0.243903
🇦🇺	AU	1 223	0.072882	0.196764
🇸🇪	SE	1 204	0.071750	0.193707
🇱🇺	LU	1 143	0.068115	0.183893
🇩🇰	DK	1 077	0.064181	0.173274
🇷🇺	RU	1 025	0.061083	0.164908
🇯🇵	JP	683	0.040702	0.109885
🇰🇷	KR	621	0.037007	0.099910
🇧🇪	BE	554	0.033014	0.089131
🇲🇾	MY	495	0.029498	0.079639
🇺🇦	UA	458	0.027294	0.073686
🇪🇸	ES	368	0.021930	0.059206
🇦🇪	AE	364	0.021692	0.058563
🇿🇦	ZA	269	0.016030	0.043278
🇹🇷	TR	266	0.015852	0.042796
🇱🇹	LT	242	0.014421	0.038934
🇮🇳	IN	218	0.012991	0.035073
🇪🇪	EE	196	0.011680	0.031534
🇸🇮	SI	185	0.011025	0.029764
🇧🇬	BG	171	0.010190	0.027512
🇷🇴	RO	155	0.009237	0.024937
🇳🇴	NO	122	0.007270	0.019628
🇻🇬	VG	97	0.005781	0.015606
🇲🇺	MU	92	0.005483	0.014802
🇰🇿	KZ	87	0.005185	0.013997
🇨🇾	CY	75	0.004469	0.012066
🇱🇻	LV	61	0.003635	0.009814
🇭🇷	HR	24	0.001430	0.003861
🇵🇹	PT	23	0.001371	0.003700
🇮🇱	IL	22	0.001311	0.003539
🇬🇪	GE	21	0.001251	0.003379
🇮🇲	IM	21	0.001251	0.003379
🇻🇳	VN	21	0.001251	0.003379
🇮🇸	IS	19	0.001132	0.003057
🇧🇾	BY	17	0.001013	0.002735
🇷🇸	RS	14	0.000834	0.002252
🇱🇮	LI	13	0.000775	0.002092
🇬🇷	GR	9	0.000536	0.001448
🇺🇿	UZ	8	0.000477	0.001287
🇺🇬	UG	8	0.000477	0.001287
🇳🇿	NZ	7	0.000417	0.001126
🇸🇦	SA	7	0.000417	0.001126
🇲🇩	MD	6	0.000358	0.000965
🇱🇧	LB	5	0.000298	0.000804
🇧🇿	BZ	5	0.000298	0.000804
🇮🇩	ID	4	0.000238	0.000644
🇬🇮	GI	4	0.000238	0.000644
🇵🇭	PH	4	0.000238	0.000644
🇦🇲	AM	3	0.000179	0.000483
🇩🇲	DM	3	0.000179	0.000483
🇧🇷	BR	3	0.000179	0.000483
🇸🇲	SM	2	0.000119	0.000322
🇲🇹	MT	2	0.000119	0.000322
🇵🇦	PA	2	0.000119	0.000322
🇮🇶	IQ	2	0.000119	0.000322
🇧🇦	BA	2	0.000119	0.000322
🇨🇷	CR	2	0.000119	0.000322
🇲🇰	MK	2	0.000119	0.000322
🇧🇩	BD	2	0.000119	0.000322
🇹🇭	TH	2	0.000119	0.000322
🇲🇳	MN	1	0.000060	0.000161
🇲🇽	MX	1	0.000060	0.000161
🇦🇱	AL	1	0.000060	0.000161
🇵🇰	PK	1	0.000060	0.000161
🇲🇪	ME	1	0.000060	0.000161
🇹🇳	TN	1	0.000060	0.000161
🇲🇨	MC	1	0.000060	0.000161
🇨🇱	CL	1	0.000060	0.000161
🇮🇷	IR	1	0.000060	0.000161
🇦🇷	AR	1	0.000060	0.000161
🇰🇼	KW	1	0.000060	0.000161
🇻🇨	VC	1	0.000060	0.000161

Domains using MX records across autonomous systems

Code

df_asn <- df_asn |>
  mutate(perc      = mx / sum(mx) * 100,
         perc_nocz = mx / (sum(mx) - 1222588) * 100,
         perc_cz   = mx / (sum(mx) - 364809) * 100) |>
  mutate(perc_nocz =
           if_else(cc == "CZ", NA, perc_nocz),
         perc_cz   =
           if_else(cc != "CZ", NA, perc_cz)) |>
  mutate(perc      = round(perc, 2),
         perc_nocz = round(perc_nocz, 2),
         perc_cz   = round(perc_cz, 2)) |>
  arrange(desc(mx))
df_asn |>
  mutate(holder       = replace_na(holder, "Unknown"),
         country_flag = replace(country_flag,
                                holder == "Unknown",
                                " ")) |>
  select(country_flag,
         cc,
         asn,
         holder,
         mx,
         perc,
         perc_nocz,
         perc_cz) |>
  arrange(desc(mx)) |>
  filter(mx > 500) |>
  rename("Flag"                 = "country_flag",
         "Country"              = "cc",
         "ASN"                  = "asn",
         "AS & holder"          = "holder",
         "Domains"              = "mx",
         "Percentage"           = "perc",
         "Percentage (no CZ)"   = "perc_nocz",
         "Percentage (CZ only)" = "perc_cz") |>
  kable("html",
        digits = 2,
        format.args = list(big.mark = " ")) |>
  kable_styling(
    bootstrap_options = c("striped",
                          "hover",
                          "condensed",
                          "responsive"),
    full_width = TRUE,
    position   = "left")

Table 5: Autonomous systems with >500 domains.
Flag	Country	ASN	AS & holder	Domains	Percentage	Percentage (no CZ)	Percentage (CZ only)
🇨🇿	CZ	197019	WEDOS WEDOS Internet, a.s.	215 911	13.29	NA	17.14
🇨🇿	CZ	15685	CASABLANCA-AS CASABLANCA INT a.s.	145 462	8.95	NA	11.55
🇨🇿	CZ	24806	INTERNET-CZ INTERNET CZ, a.s.	140 435	8.64	NA	11.15
🇨🇿	CZ	43541	VSHOSTING VSHosting s.r.o.	117 684	7.24	NA	9.34
🇨🇿	CZ	25234	GLOBE-AS ACTIVE 24, s.r.o.	97 072	5.98	NA	7.71
🇨🇿	CZ	29134	IGNUM-AS Webglobe, s.r.o.	91 286	5.62	NA	7.25
🇨🇿	CZ	24971	MASTER-AS Master Internet s.r.o.	70 591	4.35	NA	5.60
🇸🇰	SK	48689	WEBGLOBE-SK-AS Webglobe, a.s.	58 605	3.61	14.58	NA
🇺🇸	US	15169	GOOGLE	43 248	2.66	10.76	NA
🇺🇸	US	8075	MICROSOFT-CORP-MSN-AS-BLOCK	39 911	2.46	9.93	NA
🇸🇰	SK	51013	WEBSUPPORT-SRO-SK-AS WebSupport s.r.o.	37 964	2.34	9.44	NA
🇨🇿	CZ	43037	SEZNAM-CZ Seznam.cz, a.s.	36 811	2.27	NA	2.92
	NA	NA	Unknown	36 669	2.26	NA	NA
🇨🇿	CZ	34222	ZONER-AS ZONER a.s.	34 736	2.14	NA	2.76
🇨🇿	CZ	39392	SuperNetwork SH.cz s.r.o.	33 013	2.03	NA	2.62
🇨🇿	CZ	39790	WEB4U Web4U s.r.o.	28 826	1.77	NA	2.29
🇩🇪	DE	24940	HETZNER-AS Hetzner Online GmbH	26 846	1.65	6.68	NA
🇨🇿	CZ	29208	QUANTCOM-AS Quantcom, a.s.	26 143	1.61	NA	2.08
🇺🇸	US	14061	DIGITALOCEAN-ASN	20 620	1.27	5.13	NA
🇨🇿	CZ	60592	GRANSY Gransy s.r.o.	16 904	1.04	NA	1.34
🇨🇿	CZ	59925	GIGASERVER Seonet Multimedia s.r.o.	16 527	1.02	NA	1.31
🇺🇸	US	13335	CLOUDFLARENET	15 257	0.94	3.80	NA
🇨🇿	CZ	16019	VODAFONE-CZ-AS Vodafone Czech Republic a.s.	14 996	0.92	NA	1.19
🇨🇿	CZ	35592	COOLHOUSING-AS Coolhousing s.r.o.	13 679	0.84	NA	1.09
🇺🇸	US	16509	AMAZON-02	10 961	0.67	2.73	NA
🇫🇷	FR	16276	OVH OVH SAS	10 483	0.65	2.61	NA
🇨🇿	CZ	13036	TMOBILE-CZ T-Mobile Czech Republic a.s.	9 297	0.57	NA	0.74
🇨🇿	CZ	25248	BlueTone-AS RADIOKOMUNIKACE a.s.	8 715	0.54	NA	0.69
🇨🇿	CZ	5610	O2-CZECH-REPUBLIC O2 Czech Republic, a.s.	7 718	0.48	NA	0.61
🇺🇸	US	396982	GOOGLE-CLOUD-PLATFORM	7 526	0.46	1.87	NA
🇸🇰	SK	29405	VNET-AS VNET a.s.	5 470	0.34	1.36	NA
🇸🇬	SG	64050	BCPL-SG BGPNET Global ASN	5 076	0.31	1.26	NA
🇨🇿	CZ	42422	SECURITYNET-AS SecurityNet.cz s.r.o.	5 066	0.31	NA	0.40
🇨🇿	CZ	197013	SPRINTEL-SRO Sprintel s.r.o.	5 008	0.31	NA	0.40
🇨🇿	CZ	24641	FASTER-AS FASTER CZ spol. s r.o.	4 578	0.28	NA	0.36
🇨🇿	CZ	51210	KRAXNET-AS KRAXNET s.r.o.	4 128	0.25	NA	0.33
🇮🇱	IL	58182	wix_com Wix.com Ltd.	4 050	0.25	1.01	NA
🇨🇿	CZ	47317	WEB4CE Web4ce, s.r.o.	3 999	0.25	NA	0.32
🇨🇿	CZ	44984	Fortion Fortion Networks, s.r.o.	3 124	0.19	NA	0.25
🇨🇿	CZ	30764	PODA-AS PODA a.s.	3 108	0.19	NA	0.25
🇸🇬	SG	63949	AKAMAI-LINODE-AP Akamai Connected Cloud	2 551	0.16	0.63	NA
🇩🇪	DE	51167	CONTABO Contabo GmbH	2 469	0.15	0.61	NA
🇨🇿	CZ	21430	WIA-AS WIA spol. s.r.o.	2 006	0.12	NA	0.16
🇩🇪	DE	15598	IPX-AS15598 NorthC Deutschland GmbH	2 004	0.12	0.50	NA
🇨🇿	CZ	12570	ITSELF Nej.cz s.r.o.	1 926	0.12	NA	0.15
🇨🇿	CZ	35236	AS35236 2 connect a.s.	1 792	0.11	NA	0.14
🇨🇿	CZ	196653	ASBESTNETCZ BEST-NET s.r.o.	1 743	0.11	NA	0.14
🇫🇷	FR	29169	GANDI-AS GANDI SAS	1 719	0.11	0.43	NA
🇸🇰	SK	5578	AS-BENESTRA SWAN, a.s.	1 714	0.11	0.43	NA
🇨🇿	CZ	2852	CESNET2 CESNET z.s.p.o.	1 626	0.10	NA	0.13
🇨🇿	CZ	25512	CDT-AS CD-Telematika a.s.	1 550	0.10	NA	0.12
🇺🇸	US	2639	ZOHO-AS	1 456	0.09	0.36	NA
🇨🇿	CZ	198167	AppToCloud Apptc.me s.r.o.	1 453	0.09	NA	0.12
🇩🇪	DE	61969	TEAMINTERNET-AS Team Internet AG	1 326	0.08	0.33	NA
🇨🇿	CZ	206566	SAVANACZ Webglobe, s.r.o.	1 265	0.08	NA	0.10
🇨🇿	CZ	43542	OptoNet-AS OptoNet Communication, spol. s.r.o.	1 249	0.08	NA	0.10
🇨🇿	CZ	56566	SATT-AS SATT a.s.	1 243	0.08	NA	0.10
🇨🇿	CZ	8251	NFX_ZSPO FreeTel, s.r.o.	1 232	0.08	NA	0.10
🇬🇧	GB	48254	TWENTYI 20i Limited	1 203	0.07	0.30	NA
🇨🇿	CZ	52092	ALFSERVIS-AS Alf servis, s.r.o.	1 198	0.07	NA	0.10
🇨🇿	CZ	200828	THOSTING-AS TERMS a.s.	1 194	0.07	NA	0.09
🇨🇿	CZ	202682	g2server Geetoo CZ s.r.o.	1 145	0.07	NA	0.09
🇩🇪	DE	21499	GODADDY-SXB Host Europe GmbH	1 128	0.07	0.28	NA
🇨🇿	CZ	49985	IP4ISP IP4ISP z.s.p.o	1 089	0.07	NA	0.09
🇨🇿	CZ	47232	ISPAlliance ISP Alliance a.s.	1 072	0.07	NA	0.09
🇨🇿	CZ	28725	CETIN-AS CETIN a.s.	1 070	0.07	NA	0.08
🇺🇸	US	14618	AMAZON-AES	1 057	0.07	0.26	NA
🇨🇿	CZ	44770	SAVVY-AS Savvy s.r.o.	1 038	0.06	NA	0.08
🇨🇿	CZ	44489	Starnet STARNET, s.r.o.	1 023	0.06	NA	0.08
🇨🇿	CZ	42306	EDERA_Group Edera Group a.s.	995	0.06	NA	0.08
🇺🇸	US	54113	FASTLY	989	0.06	0.25	NA
🇨🇿	CZ	20723	MGI AVONET, s.r.o.	967	0.06	NA	0.08
🇭🇰	HK	140227	HKCICL-AS-AP Hong Kong Communications International Co., Limited	963	0.06	0.24	NA
🇫🇷	FR	200185	XANDMAIL-ASN Aruba SAS	958	0.06	0.24	NA
🇺🇸	US	714	APPLE-ENGINEERING	948	0.06	0.24	NA
🇨🇿	CZ	206548	TLAP-ZCOM ZCOM.cz s.r.o	894	0.06	NA	0.07
🇭🇰	HK	132325	LEMON-AS-AP LEMON TELECOMMUNICATIONS LIMITED	889	0.05	0.22	NA
🇨🇿	CZ	50695	VLP-AS VLTAVA LABE MEDIA a.s.	881	0.05	NA	0.07
🇭🇰	HK	136950	HIITL-AS-AP Hong Kong FireLine Network LTD	871	0.05	0.22	NA
🇨🇿	CZ	57707	GREENDATA Greendata s.r.o.	852	0.05	NA	0.07
🇺🇸	US	40065	CNSERVERS	845	0.05	0.21	NA
🇸🇰	SK	42005	LIGHTSTORM-COMMUNICATIONS-SRO-SK-AS LightStorm Services s.r.o.	819	0.05	0.20	NA
🇺🇸	US	46606	UNIFIEDLAYER-AS-1	813	0.05	0.20	NA
🇩🇰	DK	51468	ONECOM One.com A/S	811	0.05	0.20	NA
🇨🇿	CZ	15935	HA-VEL-LOCAL-AS ha-vel internet s.r.o.	806	0.05	NA	0.06
🇨🇭	CH	41913	COMPUTERLINE Computerline GmbH	791	0.05	0.20	NA
🇨🇿	CZ	16246	AS16246 Nej.cz s.r.o.	771	0.05	NA	0.06
🇵🇱	PL	12824	HOMEPL-AS home.pl S.A.	770	0.05	0.19	NA
🇲🇾	MY	55720	GIGABIT-MY Gigabit Hosting Sdn Bhd	764	0.05	0.19	NA
🇭🇰	HK	59371	DNC-AS Dimension Network & Communication Limited	742	0.05	0.18	NA
🇱🇺	LU	24611	DCLUX-AS Datacenter Luxembourg S.A.	722	0.04	0.18	NA
🇨🇿	CZ	41453	TRESTEL-CZ-ASN SITEL spol. s r.o.	714	0.04	NA	0.06
🇨🇿	CZ	207167	amccomp-master-brno amccomp s.r.o.	704	0.04	NA	0.06
🇨🇭	CH	62371	PROTON Proton AG	700	0.04	0.17	NA
🇭🇰	HK	136970	YISUCLOUDLTD-AS-AP YISU CLOUD LTD	673	0.04	0.17	NA
🇨🇿	CZ	43070	JAW-AS JAW.cz s.r.o.	665	0.04	NA	0.05
🇦🇺	AU	133618	TRELLIAN-AS-AP Trellian Pty. Limited	626	0.04	0.16	NA
🇺🇸	US	174	COGENT-174	616	0.04	0.15	NA
🇨🇿	CZ	12767	PRAGONET-AS T-Mobile Czech Republic a.s.	605	0.04	NA	0.05
🇨🇿	CZ	25424	INEXT-CZ InterneXt 2000, s.r.o.	605	0.04	NA	0.05
🇫🇷	FR	24935	ATE-AS AVENIR TELEMATIQUE SAS	602	0.04	0.15	NA
🇺🇸	US	19574	CSC	599	0.04	0.15	NA
🇨🇿	CZ	31246	NETBOX-AS Nej.cz s.r.o.	592	0.04	NA	0.05
🇬🇧	GB	34922	NETNAMES CSC Administrative Services Limited	586	0.04	0.15	NA
🇨🇿	CZ	49767	INTERNETPB-AS Nej.cz s.r.o.	584	0.04	NA	0.05
🇨🇿	CZ	15614	Dragon Dragon Internet a.s.	560	0.03	NA	0.04
🇨🇿	CZ	39817	OVANET OvaNet, a.s.	558	0.03	NA	0.04
🇩🇪	DE	3320	DTAG Deutsche Telekom AG	551	0.03	0.14	NA
🇨🇿	CZ	43614	ECONOMIA-CZ Economia a.s.	543	0.03	NA	0.04
🇭🇰	HK	136778	AIJIASU-AS-AP HONGKONG AI JIA SU NETWORK CO.,LIMITED	523	0.03	0.13	NA
🇺🇸	US	19871	NETWORK-SOLUTIONS-HOSTING	522	0.03	0.13	NA

Countries with missing values

Code

df_cc_NAs |>
  arrange(cc_iso) |>
  mutate(name = countrycode(
    cc_iso, origin = "iso3c",
    destination    = "country.name")) |>
  select(mx,
         cc_iso,
         name,
         distances,
         population,
         gdp,
         export,
         import) |>
  rename("Domains"        = "mx",
         "cc"             = "cc_iso",
         "Country"        = "name",
         "Distance"       = "distances",
         "Population"     = "population",
         "GDP per capita" = "gdp",
         "CZ export"      = "export",
         "Foreign import" = "import") |>
  kable("html",
        digits = 2,
        format.args = list(big.mark = " ")) |>
  kable_styling(
    bootstrap_options = c("striped",
                          "hover",
                          "condensed",
                          "responsive"),
    full_width = TRUE,
    position   = "left")

Table 6: Countries with missing values.
Domains	cc	Country	Distance	Population	GDP per capita	CZ export	Foreign import
0	ABW	Aruba	NA	106 537	29 342.10	193 538	5 515
0	AFE	NA	NA	702 976 832	1 549.77	0	0
0	AFW	NA	NA	478 185 907	1 757.03	0	0
0	AIA	Anguilla	NA	NA	NA	2 554	84 322
0	AND	Andorra	1 330.73	79 034	42 137.33	1 893 288	802 964
0	ARB	NA	NA	456 520 777	6 271.32	0	0
0	ASM	American Samoa	NA	45 035	15 743.31	0	6 298
0	ATA	Antarctica	14 459.66	NA	NA	0	0
0	ATF	French Southern Territories	NA	NA	NA	161 063	210 451
0	ATG	Antigua & Barbuda	NA	93 219	15 781.40	159 262	392 345
0	BLM	St. Barthélemy	NA	NA	NA	52 121	70 599
0	CCK	Cocos (Keeling) Islands	NA	NA	NA	0	24 608
0	CEB	NA	NA	101 430 997	18 751.03	0	0
0	CHI	NA	NA	172 683	NA	0	0
0	COK	Cook Islands	16 806.80	NA	NA	0	89 407
0	CPV	Cape Verde	NA	587 925	3 293.23	314 433	6 961
0	CSS	NA	NA	7 481 877	10 063.69	0	0
0	CUW	Curaçao	NA	152 369	17 717.60	335 455	139
0	CXR	Christmas Island	NA	NA	NA	0	29 606
0	CYM	Cayman Islands	NA	68 136	86 568.77	186 368	53 520
0	CZE	Czechia	NA	10 505 772	26 821.25	0	0
3	DMA	Dominica	NA	72 412	7 653.17	299 631	541 919
0	EAP	NA	NA	2 123 673 456	9 771.65	0	0
0	EAR	NA	NA	3 411 889 059	3 703.90	0	0
0	EAS	NA	NA	2 370 204 347	13 041.78	0	0
0	ECA	NA	NA	401 575 218	8 758.94	0	0
0	ECS	NA	NA	923 777 214	27 152.47	0	0
0	EMU	NA	NA	343 067 846	42 450.15	0	0
0	EUU	NA	NA	447 199 800	38 411.06	0	0
0	FCS	NA	NA	1 001 462 452	1 796.40	0	0
0	FLK	Falkland Islands	13 274.65	NA	NA	0	3 687 603
0	FSM	Micronesia (Federated States of)	NA	113 131	3 571.34	373 213	1 601
4	GIB	Gibraltar	NA	32 669	NA	617 982	641 292
0	GRD	Grenada	NA	124 610	9 010.57	51 303	342 465
0	HIC	NA	NA	1 240 629 858	48 225.24	0	0
0	HPC	NA	NA	861 156 745	1 035.70	0	0
0	IBD	NA	NA	4 895 364 297	7 235.96	0	0
0	IBT	NA	NA	6 695 397 735	5 684.93	0	0
0	IDA	NA	NA	1 800 033 438	1 465.40	0	0
0	IDB	NA	NA	595 016 031	1 842.77	0	0
0	IDX	NA	NA	1 205 017 407	1 281.63	0	0
21	IMN	Isle of Man	NA	84 263	NA	0	0
0	INX	NA	NA	NA	NA	0	0
0	IOT	British Indian Ocean Territory	NA	NA	NA	0	63 856
0	KNA	St. Kitts & Nevis	NA	47 606	18 082.61	2 014	1 232 314
0	LAC	NA	NA	593 308 001	7 728.03	0	0
0	LCA	St. Lucia	NA	179 651	9 414.23	19 514	73 362
0	LCN	NA	NA	654 981 699	8 327.61	0	0
0	LDC	NA	NA	1 099 568 566	1 164.92	0	0
0	LIC	NA	NA	709 088 502	793.95	0	0
13	LIE	Liechtenstein	NA	39 039	NA	0	0
0	LMC	NA	NA	3 398 187 527	2 572.71	0	0
0	LMY	NA	NA	6 619 578 961	5 494.40	0	0
0	LTE	NA	NA	2 321 862 159	11 465.86	0	0
0	MAC	Macao SAR China	NA	686 607	43 873.59	2 410 625	6 102 867
0	MAF	Saint Martin (French part)	NA	31 948	NA	295 045	1 425
1	MCO	Monaco	NA	36 686	234 315.46	0	0
0	MEA	NA	NA	486 167 363	7 569.09	0	0
0	MIC	NA	NA	5 901 323 889	6 074.08	0	0
0	MNA	NA	NA	418 047 201	3 573.17	0	0
0	MNP	Northern Mariana Islands	11 332.89	49 481	NA	0	450
0	MSR	Montserrat	NA	NA	NA	1 002	8 026
0	NAC	NA	NA	370 203 720	68 369.66	0	0
0	NFK	Norfolk Island	NA	NA	NA	0	6 165
0	NIU	Niue	NA	NA	NA	0	229 788
0	OED	NA	NA	1 372 736 513	42 446.86	0	0
0	OSS	NA	NA	32 708 003	13 878.81	0	0
0	PCN	Pitcairn Islands	15 795.68	NA	NA	0	4 579 011
0	PRE	NA	NA	1 011 043 183	1 418.59	0	0
0	PSS	NA	NA	2 602 173	3 643.85	0	0
0	PST	NA	NA	1 116 796 245	49 223.60	0	0
0	PYF	French Polynesia	NA	304 032	19 914.60	4 106 106	29 908
0	SAS	NA	NA	1 901 911 604	2 149.82	0	0
0	SHN	St. Helena	7 570.16	NA	NA	1 462 045	11 852
2	SMR	San Marino	NA	33 745	NA	3 290 768	2 546 717
0	SPM	St. Pierre & Miquelon	5 117.12	NA	NA	0	50 748
0	SSA	NA	NA	1 181 063 481	1 632.09	0	0
0	SSD	South Sudan	NA	10 748 272	NA	179 826	168 629
0	SSF	NA	NA	1 181 162 739	1 633.18	0	0
0	SST	NA	NA	42 792 053	12 588.93	0	0
0	SXM	Sint Maarten	NA	42 846	NA	0	0
0	SYR	Syria	2 499.45	21 324 367	NA	3 311 362	237 638
0	TEA	NA	NA	2 097 669 023	9 880.10	0	0
0	TEC	NA	NA	462 341 222	9 842.05	0	0
0	TKL	Tokelau	NA	NA	NA	0	36 733
0	TLA	NA	NA	639 188 695	8 108.92	0	0
0	TMN	NA	NA	413 124 452	3 572.07	0	0
0	TSA	NA	NA	1 901 911 604	2 149.82	0	0
0	TSS	NA	NA	1 181 162 739	1 633.18	0	0
0	UMC	NA	NA	2 503 136 362	10 828.05	0	0
1	VCT	St. Vincent & Grenadines	NA	104 332	8 666.39	38 485	50 047
97	VGB	British Virgin Islands	NA	31 122	NA	2 228 954	673 373
0	VIR	U.S. Virgin Islands	NA	105 870	NA	0	0
0	WLD	NA	NA	7 888 408 686	12 236.62	0	0
0	WLF	Wallis & Futuna	NA	NA	NA	0	4 386
0	XKX	NA	NA	1 786 038	5 269.78	0	0