During the early years of the Internet, all institutions operated their own mail servers. Their MX (mail exchange) resource records in the DNS thus pointed to domain names within the same domain. However, over the past 10 or 15 years, MX records have undergone a process of outsourcing to mail service providers ranging from local ISPs to large national or global technology companies such as Seznam.cz or Google’s GMail. On the other hand, many second-level domains still have MX records pointing to mail servers within the same domain (see Table 1). In this report, we explore MX records of all second-level domains under CZ with the aim of assessing the distribution and accumulation of mail services across countries and autonomous systems. We provide both exploratory descriptive statistics and also some modeling-based insights.
Code
# These values were taken from a full scan, 2023-07-06.mx_table <-data.frame("Domains with:"=c(" - at least one MX record with its own name."," - at least one MX record without its own name."," - all MX records with its own name."),Count =c(132142,808737,116388),check.names =FALSE )mx_table |>kable("html",format.args =list(big.mark =" ")) |>kable_styling(bootstrap_options =c("striped","hover","condensed","responsive"),full_width =TRUE,position ="left")
Table 1: Outsourcing of MX records.
Domains with:
Count
- at least one MX record with its own name.
132 142
- at least one MX record without its own name.
808 737
- all MX records with its own name.
116 388
We utilized two primary datasets collected by the CZ.NIC’s DNS crawler, aggregating domain counts by (1) countries and (2) autonomous systems from which the mail servers indicated in MX records are operated. In these datasets, domain names appearing in MX records were resolved to IP address(es) that were then assigned to the corresponding countries or autonomous systems via a geolocation database. Note that a second-level domain may have multiple mail servers operating from various countries and/or autonomous systems. If it is the case, such a domain contributes to the counts of multiple countries or autonomous systems.
The data used in this report are static in the sense that the datasets do not provide a dynamic view of CZ mail server trends (e.g., time series) but just a limited peek into their distributions at a certain date or time period:
For the countries, all data correspond to a single day of 2023-06-01.
For the autonomous systems, the data belong to an interval from 2023-04-20 to 2023-06-15.
Sections labeled as ▶ Code that are interspersed in the text below can be expanded to reveal the actual R code used for producing the statistics, graphs, and tables.
Country perspective
First, we explored geolocations of mail servers of all second-level domains under CZ, using data provided by DNS crawler endpoint /crawler_mail_cc, which aggregates the count of domains with MX records by countries.
Table 2 describes the distribution of second-level domains that have at least one MX record. The first row shows frequencies for all countries; however, as the global distribution is hugely influenced by the CZ frequencies (the maximum value at 1 056 497, 62.96% of all such domains; see also Figure 1 below and Table 4 in the Appendix), the second row presents frequencies for all countries except for the Czech Republic, alleviating the bias stemming from the CZ dominance.
To illustrate the absolute count of these domains, we also filtered countries with more than 100 domains and plotted a barchart which, once again, shows the overwhelming and unsurprising dominance of domains with MX records pointing to mail servers within the Czech Republic.
A complete list of all countries with domains that use MX records can be inspected in Table 4 in the Appendix. To provide additional insight into the relative weight of foreign countries, Table 4 also presents a column Domains percentage (no CZ) which specifies the percentage of all domains with MX records excluding those with mail servers located in the Czech Republic.
Spatial distribution
Furthermore, we explored the spatial distribution of mail servers indicated in the domains’ MX records. In Figure 2, we present log-transformed values of the absolute count of domains with mail servers in given countries. To inspect the absolute counts, hover the mouse cursor over the respective countries. Note, however, that the color scale representing log-transformed values is somewhat misleading in that it flattens the differences.
Modeling mail servers outside of the Czech Republic, part 1
Unsurprisingly, IP addresses of most mail servers listed in MX records are geolocated in the Czech Republic. However, one can ask what factors might lead domain holders to set up mail servers outside of the Czech Republic? To provide a better understanding of such variations and possible influences, we utilized Bayesian regression models to test the associations between the count of domains with MX records that indicate mail servers operated from abroad and five predictors which we hypothesized to associate with the these counts.
What is Bayesian regression modeling?
Linear regression analysis is used to predict the value of a response variable (i.e., dependent variable) based on the value of another predictor variable (i.e., independent variable). In the Bayesian framework, linear regression formulates probability distributions for such predictions (in comparison to point estimates formulated within the frequentist framework). That is to say, the response variable is not estimated as a single value, but is assumed to be drawn from a probability distribution as Bayesian linear regression does not attempt to find one “best” value of the model parameters, but rather to determine the posterior distribution for the model parameters. When doing so, the analyst can also incorporate prior knowledge about what the parameters should be (or use non-informative priors if one has no guesses about the possible values of the parameters).
The result of a Bayesian linear regression model are distributions of possible model parameters (most notably the values by which the response variable changes depending on the predictor variables) based on the data and the priors. Practically, one can express the relationships between the response and predictors variables by describing these posterior distributions. For example, one can report the mean of the parameter’s posterior distribution and its credible interval, which helps to capture the uncertainty about the possible values of the parameter.
Predictors
Geographic distance from the Czech Republic
First, domain holders might prefer mail servers in countries that are geographically closer to the Czech Republic as they usually provide better round-trip times. We thus hypothesized that there should be a higher count of domains that use MX records in countries closer to the Czech Republic.
To calculate the distances between the Czech Republic and other countries, we first calculated centroids for each country (the geographic point at the center of gravity for a polygon that approximates each country’s borders) and then calculated the distance between the Czech Republic’s centroid coordinates and respective foreign countries’ centroids. As the distribution of these values showed great dispersion, we log-transformed them for the purposes of modeling.
We also hypothesized that domain holders are more likely to have mail servers in countries with higher populations, as mail service providers are more motivated to deploy their servers in such countries.
The values for population sizes were downloaded from The World Bank. We used the most recent values (year 2021) which we log-transformed.
Code
pop <-read.csv("CSV/POP.csv", sep =",")pop <- pop |>select(cc, X2021) |>rename(population = X2021,cc_iso = cc)df_cc2 <-full_join(df_cc2, pop, by ="cc_iso")
GDP per capita
Furthermore, we also hypothesized that GDP per capita might increase the count of domains with MX records pointing to mail servers located in foreign countries. Again, the idea was that such countries have a better network infrastructure and may thus be more attractive for server deployments.
Again, the values for GDP per capita were downloaded from The World Bank. We used the most recent values (year 2021, in US dollars) which we log-transformed.
Code
gdp <-read.csv("CSV/GDP.csv", sep =",")gdp <- gdp |>select(cc, X2021) |>rename(gdp = X2021,cc_iso = cc)df_cc2 <-full_join(df_cc2, gdp, by ="cc_iso")
Export and import value
However, the size of Czech export to foreign countries and foreign countries’ import from the Czech Republic might provide a more granular and trading-network-reflective proxies for international connections than GDP per capita, geographic distance, and population size. Therefore, we hypothesized that the value of Czech Republic’s export to foreign countries would associate with the count of domains that use mail servers operated from abroad. Similarly, we hypothesized that the value of foreign countries’ import from the Czech Republic would associate with the count of these domains.
The values for the size of Czech export and import were downloaded from The Observatory of Economic Complexity (OEC). These values are proxies for a direct trading relationship between the Czech Republic and a given country. We used the most recent freely available values (year 2021) which we log-transformed.
Upon pairing all the predictor data with our original data on domains using MX records into one dataset, we obtained a number of rows with missing values.
First, we manually adjusted the data for the population size, GDP per capita, and geographic distance for larger countries, filling in the missing values.
Second, as our original dataset does not contain data on countries that do not operate any mail servers under CZ, we assumed that such countries have zero domains using MX records and replaced such missing values with zeros. Similarly, in the case of Czech export and foreign import, we assumed that countries not captured in the OEC datasets do not trade with the Czech Republic, hence we replaced such missing values with zeros.
Lastly, for small (mostly island) countries, we did not attempt to fill in the missing values. Therefore, these countries are omitted in the analysis (as regression models cannot handle rows with missing values) and can be inspected in Table 6 in the Appendix.
While all of the proposed predictors showed some correlations between each other (see Figure 3 below), they theoretically differ in how they function as proxies for domain holders’ tendency for setting up their mail services outside of the Czech Republic. While geographic distance, population size, and GDP per capita are properties describing respective countries, the values of export and import describe the interaction between the Czech Republic and given countries. In other words, while there is no “human design” behind geographic distance or population size of foreign countries in respect to mail servers indicated in MX records of second-level domains under CZ, both of these variables could influence these counts as a passive side-effect of geographic closeness or high-enough population numbers. Similarly, GDP per capita of foreign countries is an index not directly connected to the Czech Republic but may influence the count of domains as a side-effect of foreign countries’ economic power.
In contrast, Czech export to foreign countries and foreign import from the Czech Republic can be conceptualized as intentionally directed and active networking channels. Therefore, the count of domains with MX records pointing to mail servers in given countries could increase as domain holders might spend resources on developing their own CZ infrastructure and operations in countries where trading with the Czech Republic is more intense.
Because the distribution of domain counts was over-dispersed and showed high zero-inflation (i.e., there were many countries with no mail servers indicated in the domains’ MX records), we used zero-inflated negative binomial distribution to model both the count of domains with MX records and the proportion of zeros.
What is zero-inflated negative binomial distribution?
Zero-inflated negative binomial regression can be used for modeling count variables with excessive zeros (i.e., zero-inflation) where the distribution of the count response variable also shows overdispersion (i.e., its variability is greater than would be expected in more classical models like Poisson). Furthermore, the excess zeros can be conceptualized as generated by a separate process from the count values and can be, therefore, modeled independently from the count of the response variable.
m2 <-brm(bf(mx ~log(distances +1), zi ~log(distances +1)),family =zero_inflated_negbinomial(),data = df_cc2,prior =c(prior(normal(0,2), class = Intercept),prior(normal(0,2), class = b)),chains =4,iter =4000,warmup =1000,cores = parallel::detectCores())saveRDS(m2, file ="m2.rds")
Code
m3 <-brm(bf(mx ~log(population +1), zi ~log(population +1)),family =zero_inflated_negbinomial(),data = df_cc2,prior =c(prior(normal(0,2), class = Intercept),prior(normal(0,2), class = b)),chains =4,iter =4000,warmup =1000,cores = parallel::detectCores())saveRDS(m3, file ="m3.rds")
In model m1, we used all five predictors to estimate the count of domains with MX records pointing to mail servers located in foreign countries and the probability that no mail servers are located in the respective countries. However, as we observed severe issues with multicollinearity (when predictors correlate with each other highly, the resulting model parameters become unreliable), we do not discuss model m1 (for the code and results, see the Appendix), and focus on modeling each predictor separately.
For the sake of brevity, we also refer the reader to the Appendix for the results of models m2 (geographic distance) and m3 (population size). While they found associations with the count of domains and their zero-inflation, they performed less efficiently than other models (see below) and found weaker effects than model m4 (GDP per capita).
Model 4: GDP per capita
Code
m4 <-brm(bf(mx ~log(gdp +1), zi ~log(gdp +1)),family =zero_inflated_negbinomial(),data = df_cc2,prior =c(prior(normal(0,2), class = Intercept),prior(normal(0,2), class = b)),chains =4,iter =4000,warmup =1000,cores = parallel::detectCores())saveRDS(m4, file ="m4.rds")
Code
summary(m4)
Family: zero_inflated_negbinomial
Links: mu = log; shape = identity; zi = logit
Formula: mx ~ log(gdp + 1)
zi ~ log(gdp + 1)
Data: df_cc2 (Number of observations: 188)
Draws: 4 chains, each with iter = 4000; warmup = 1000; thin = 1;
total post-warmup draws = 12000
Regression Coefficients:
Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
Intercept -16.15 2.41 -20.65 -11.13 1.00 6951 6444
zi_Intercept 15.49 5.52 6.00 27.51 1.00 6252 4895
loggdpP1 2.41 0.25 1.89 2.89 1.00 7249 7126
zi_loggdpP1 -1.99 0.69 -3.53 -0.86 1.00 5907 4392
Further Distributional Parameters:
Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
shape 0.11 0.02 0.08 0.15 1.00 7377 6768
Draws were sampled using sampling(NUTS). For each parameter, Bulk_ESS
and Tail_ESS are effective sample size measures, and Rhat is the potential
scale reduction factor on split chains (at convergence, Rhat = 1).
How to read the results?
The results summary table above presents a great deal of information. However, only a part of the section titled Population-Level Effects is usually of prime interest.
First, the reader should focus on the columns Estimate (mean of the posterior distribution for respective predictors), l-95% CI (lower bound of the 95% credible interval), and u-95% CI (upper bound of the 95% credible interval). As noted above, these values help us interpret the parameters’ posterior distributions.
The mean estimate describes the central tendency of the parameter and can be interpreted as the value where the association between the predictor and the response variable probably lies. Then, it represents how much the values of the response change depending on the values of the predictor. Note that this value is not a single point-estimate, but a mean of many possible regression lines that the model fitted on the data.
Furthermore, we are interested in the 95% credible intervals (95% CIs) which capture where 95% of parameters’ posterior distribution lie. CIs are described by their lower (l-95% CI) and upper (u-95% CI) bound (i.e., 95% of the parameter’s distribution lie somewhere between its lower and upper bound). When looking at credible intervals, we are interested whether they contain zero—if they do, we can conclude that there is no relationship between the variables; if they don’t, there possibly is a relationship. For example, a parameter with mean M = 2.41 and 95% credible interval between 1.90 and 2.88 does not contain zero and therefore suggests a positive relationship between the response and predictor variables. However, a parameter with mean M = 2.41 and 95% credible interval between -1.90 and 2.88 contains zero, suggesting a null effect.
Then, in the same section (Population-Level Effects), we are mostly interested in rows describing the parameters for the investigated predictors (here labelled as loggdpP1 and zi_loggdpP1). These two describe the respective relationships between the response and predictor variables. For loggdpP1, we see that the effect is positive as the 95% CI lies above zero, suggesting that GDP per capita increases the count of domains. For zi_loggdpP1 (the zi stand for zero-inflation), we observe a negative effect as the 95% CI lies below zero, suggesting that GDP per capita decreases the probability of observing zero domains with MX records.
However, the reported parameter values are not intuitively interpretable as the predictors have been log-transformed in the formula of the model, the model utilized a log link function to map the predictor values on the response values, and a logit link function to map the predictor values on the zero-inflation probability. In the case of predictors for the count of domains, we can interpret the results as a percent increase. For example, in model m4, 1% increase in GDP per capita associates with 2.4% increase in the count of domains with MX records. For any x percent increase, one has to calculate 1.x to the power of the coefficient, subtract 1, and multiply by 100. For example, a 30% increase in the GDP per capita results in 87.7% increase in the count of domains ((1.30^2.40 - 1) * 100 = 87.7).
Naturally, charts provide an important visual aid for understanding the relationships between the predictor and response variables. In a nutshell, the (blue) regression lines help us understand how the values of the response variable on the y-axis change depending on the values of the predictor on the x-axis (i.e., what is their association).
Note that for the domain count (right subplot), both variables are on their original scale. For the zero-inflation (left subplot), the GDP per capita is on its original scale, while the values for the probability of zero-inflation are mapped between 0 and 1, using a logit link.
In the domain count chart (right), we can also inspect a scatter of the countries in which the mail servers are set up. Here, each dot represents one country positioned accordingly to the country’s GDP per capita and the respective count of domains with MX records. Note that not all countries are represented in these charts, as the y-axis has been cut to focus on the regression lines.
Crucially, as we used Bayesian regression analysis, we have not obtained only one regression line, which would be fitted to the data-points, but a whole distribution of lines. Therefore, we may observe a number of lines that have been sampled from the posterior distribution. This sample helps to visualize the distribution of a given parameter—lines get more dense closer to the middle (where the mean is represented by one thick white line). Furthermore, the sample is also bounded by the 95% CI. Together, they illustrate both the relationship between the variables and the model’s uncertainty about this relationship.
Model m4 estimated that GDP per capita associates negatively with the zero-inflation of domains using MX records and positively with their count. Note that the left subplot presents the zero-inflation (descending lines suggest lower probability of observing zero domains using MX records) and the right subplot the count of domains (rising lines suggest higher counts of domains using MX records).
Model 5: Czech export
Code
m5 <-brm(bf(mx ~log(export +1), zi ~log(export +1)),family =zero_inflated_negbinomial(),data = df_cc2,prior =c(prior(normal(0,2), class = Intercept),prior(normal(0,2), class = b)),chains =4,iter =4000,warmup =1000,cores = parallel::detectCores())saveRDS(m5, file ="m5.rds")
Code
summary(m5)
Family: zero_inflated_negbinomial
Links: mu = log; shape = identity; zi = logit
Formula: mx ~ log(export + 1)
zi ~ log(export + 1)
Data: df_cc2 (Number of observations: 188)
Draws: 4 chains, each with iter = 4000; warmup = 1000; thin = 1;
total post-warmup draws = 12000
Regression Coefficients:
Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
Intercept -4.02 1.46 -6.71 -0.95 1.00 7745 6234
zi_Intercept 22.21 4.45 14.65 32.04 1.00 10723 7010
logexportP1 0.59 0.07 0.44 0.74 1.00 8162 6406
zi_logexportP1 -1.31 0.26 -1.90 -0.86 1.00 10936 7037
Further Distributional Parameters:
Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
shape 0.16 0.02 0.12 0.21 1.00 10282 8845
Draws were sampled using sampling(NUTS). For each parameter, Bulk_ESS
and Tail_ESS are effective sample size measures, and Rhat is the potential
scale reduction factor on split chains (at convergence, Rhat = 1).
Figure 6: Associations between foreign import and domains with MX records.
Model m6 estimated that foreign import from the Czech Republic associates negatively with the zero-inflation of domains using MX records and positively with their count.
ELPD: Expected log predictive density. Larger ELPD values mean better fit.
LOOIC: Leave-one-out cross-validation (LOO) information criterion. Lower LOOIC values mean better fit.
WAIC: Widely applicable information criterion. Lower WAIC values mean better fit.
When compared, model m5 (using the export predictor, logexportP1) seems to be the best fit for the data. The model m6 (using the import predictor, logimportP1) ranked second, followed by the model m4 (using the GDP per capita predictor, loggdpP1). Models m2 (distance, logdistancesP1) and m3 (population, logpopulationP1) fared worse than models m4, m5, and m6.
Discussion
We investigated what factors might motivate domain holders to set up mail servers outside of the Czech Republic. We explored the influence of (1) foreign countries’ geographic distance from the Czech Republic, (2) the foreign countries’ population size, (3) the foreign countries’ GDP per capita, (4) Czech Republic’s export to the respective foreign countries, and (5) foreign countries’ import from the Czech Republic. We found that all of these predictors are associated with the counts of domains that use MX records pointing to foreign mail servers, however, due to multicollinearity issues, we were not able to estimate them within one model.
As our models predicted both the count of domains and also their zero-inflation (in many foreign countries, there are no mail servers for CZ domains), we found that nearly all of the predictors had complementary associations with both of these processes. For example, as foreign countries’ GDP per capita increased, the count of domains using mail servers abroad increased, while the probability of operating zero mail servers from abroad decreased. In other words, economically stronger countries had lower chances of having no mail servers outsourced from the Czech Republic and also higher count of such mail servers. This complementary trend applies to all of the predictors.
Interestingly, GDP per capita found the strongest associations with the count of domains that use foreign mail server services and their zero-inflation; however, models using the export and import values provided a better fit for the data.
Knowing that these modeling results are correlational associations and cannot suggest causal links, we can summarize that domain holders’ interest in foreign mail server services:
Decreases with rising geographic distance from the Czech Republic.
Increases with rising population size of foreign countries (However, the positive association with the count of domains that use foreign mail servers is not certain.).
Increases with rising GDP per capita of foreign countries.
Increases with rising value of Czech export to foreign countries.
Increases with rising value of foreign import from the Czech Republic.
Still, we may speculate about the causal processes behind the associations observed for export and import values. As we noted above, in comparison to the “passive” proxies of geographic distance, population size, and GDP per capita, export and import values reflect “active”, intentionally directed channels in human networks. Then, intense trading between Czech and foreign companies may motivate these actors to increasingly spend resources on developing larger mail server infrastructure. However, to establish these causal effects, further analysis using time series would be needed (given that natural variation in export and import values would provide sufficient data for such a natural experiment).
Autonomous systems perspective
In this section, we explored domains using MX records in respect to autonomous systems, using data provided by CZ.NIC’s DNS crawler endpoint /crawler_mail_asn. To provide further detail, we also paired these data with public information on autonomous systems holders and their respective domicile.
Table 3 presents the distribution of domains using MX records in all autonomous systems with at least one domain with MX record. The first row shows statistics for all autonomous systems operated by companies with domicile in the Czech Republic, while the second row specifies values for all autonomous systems with foreign holders.
Table 3: Distribution of domains with MX records across autonomous systems.
Subset
Min
Mean
Median
Max
Total
Percent
CZ
1
2 557.87
39
215 911
1 222 660
75.26%
Foreign
1
126.89
2
58 605
365 050
22.47%
Next, to illustrate the absolute count of domains using MX records across autonomous systems, we filtered autonomous systems with more than 1000 domains with MX records and plotted a barchart (see Figure 7). Note that red-colored bars represent autonomous systems with domicile in the Czech Republic, while the blue-colored bars all autonomous systems with domicile outside of the Czech Republic (with one teal-colored bar for unknown autonomous systems). Similarly as above, the chart illustrates the unsurprising dominance of autonomous systems with domicile in the Czech Republic.
Figure 7: Domains with MX records by ASN (>1000).
A longer list of autonomous systems with more than 500 domains using MX records can be inspected in Table 5 in the Appendix. To provide a separate insight into the representation of foreign domains using MX records, Table 5 also presents a column Domains' percentage (no CZ) which specifies the percentage of all domains with MX records excluding the autonomous systems with domicile within the Czech Republic. Similarly, the table also presents a column Domains' percentage (CZ only) specifying the percentage of all domains using MX records with domicile within the Czech Republic only.
Modeling mail servers outside of the Czech Republic, part 2
As in the country-focused section, we tested the associations between the count of domains with MX records pointing to mail servers outside of the Czech Republic and the same set of predictors. However, the fact that autonomous systems often span across the borders of many countries makes the assumption of domains with mail servers outside of the Czech Republic somewhat complicated. As a workaround, we used the autonomous systems’ holders domicile as a key by which we paired the counts of domains with MX records to the predictor values. While problematic—the location of autonomous systems’ holders domicile may easily differ from the actual geolocations of mail servers’ IP addresses—, this approach enabled us to use the country-specific predictors to model the count of domains.
Initially, we ran model m7 (see Appendix) using the negative binomial distribution to predict the count of domains using all five predictors with varying intercepts for the domicile of the autonomous systems’ holders. However, the posterior predictive check revealed that the model struggled to capture the inflation of autonomous systems with only one domain and simultaneously overestimated the count of zeros (see Figure 20 in the Appendix). Therefore, we opted for the use of zero-inflated negative binomial distribution to predict the count of domains in another model m8. However, as the inflation was not observed for zeros but for ones, we discarded zero observations from the dataset (there were 110 observations out of a total of 2898) and subtracted 1 from the count of domains, which enabled us to model the one-inflation of the domains’ count and obtain a better fit for the data. Furthermore, high multicollinearity was detected for the export predictor (and medium multicollinearity values for distance, population, and import predictors); therefore, we omitted the export and also import (as it was highly corelated with the export) predictors in model m8, and modeled them separately in m9 and m10.
What is posterior predictive check?
A posterior predictive check is a method for comparing the values that the model predicts and the actual values upon which the model was estimated. The goal is to inspect whether the fitted model adequately describes the observed data. Put simply, if the data predicted by the posterior predictive check deviate from the actual observed data, we should worry and try to find a better approach for modeling the relationships.
Figure 8: Associations between predictors and domains with MX records.
Model m8 estimated that GDP per capita positively associates with the count of domains and—strangely—also with their “zero”-inflation (which, in fact, was a recoded one-inflation). Other predictors estimated null associations.
The relationships between the predictors and the count of domains are plotted in Figure 8 above. Note that the left column plots the one-inflation (rising lines suggest higher probability of observing one domain using MX records) and the right column the count of domains (rising lines suggest higher counts of domains using MX records, for all values above one).
The models performed similarly well on the data, although m8 fared the best.
Discussion
Similarly as above, we investigated what might motivate domain holders to set up mail servers “outside” of the Czech Republic. However, in comparison to the models reported in the country-focused section (where the count of domains with MX records was aggregated by the respective countries where the mail server services were located), data used in this analysis aggregated the count of domains with MX records by their respective autonomous systems. Furthermore, the predictor variables used in this analysis were paired to the domain counts by the domicile of the holders that operate the autonomous systems.
We found that foreign countries’ GDP per capita and the value of Czech export were associated with the count of domains with MX records pointing to mail servers operated by autonomous systems’ holders with domicile outside of the Czech Republic. Crucially, as the models focusing on autonomous systems did not observe severe multicollinearity issues as models focusing on countries, they provided us with better control over the investigated predictors. Then, we observed that while GDP per capita remained positively associated with the count of domains, the associations with geographic distance and population size disappeared. This difference from previous models seems to make sense as the geographic distance and population size predictors were not paired by the actual geolocation of autonomous systems but by the domicile of the autonomous systems’ holders, making them disconnected from the count of domains in these data.
Although the export and import values had to modeled separately due to multicollinearity issues, we found a positive relationship between the value of Czech export and the count of domains. Therefore, we may suggest that domain holder’s tendency to set up their mail servers “abroad” is positively associated only with economic proxies—at least when using data where the count of domains is aggregated by autonomous systems and the predictors are paired by the domicile of the autonomous systems’ holders.
Modeling domains with MX records among Czech companies
To complement the focus on domains with mail server services set up outside of the Czech Republic, we also modeled the variations in the count of domains within autonomous systems for which their holders have domicile in the Czech Republic. Here, we focused on modeling the relationship between the count of domains with MX records and the financial turnover of autonomous systems’ holders.
To do so, we surveyed the public register—searching for financial statements of respective companies—and created a dataset with values for their turnover. As there were 475 autonomous systems operated by Czech companies, we set an arbitrary limit where we collected turnover data only for companies with more than 100 domains with MX records within their autonomous systems. While we were not able to find turnover values for all of these companies (as the public register does not always provide such data), our final dataset contained turnover values for 117 Czech companies (out of 155 companies with more than 100 domains; ignoring 320 companies with less than 100 domains). The resulting dataset accounted for 975 267 domains with MX records (79.77% of all domains within autonomous systems operated by Czech holders), ignoring 247 321 domains (20.2%) out of which 8 566 domains were operated in autonomous systems with less than 100 domains. Note that such imperfections may bias the results below.
To model the relationship between the count of domains with MX records pointing to mail servers operated by holders with domicile in the Czech Republic and the their financial turnover, we used the negative binomial distribution and log-transformed the turnover values. Furthermore, as some companies operated more than one autonomous system, we also specified a varying intercept for the holders of autonomous systems to account for such nested variations.
Figure 11: Associations between CZ companies’ turnover and domains with MX records.
Model m11 estimated a positive association between the count of domains and the financial turnover of the Czech companies that operate the respective autonomous systems. The relationship is plotted above in Figure 11.
General discussion
In this ADAM report, we explored the distribution of MX records of all second-level domains under CZ across the world, looking into datasets which aggregated the count of domains with MX records by countries and by autonomous systems. Unsurprisingly, we observed that most domains using MX records use mail servers located within the Czech Republic.
We also explored what lies behind the domain holders’ tendency for setting up their mail servers outside of the Czech Republic. When using data which aggregated the count of domains by countries in which the IP addresses of mail servers were geolocated, geographic distance from the Czech Republic, foreign countries’ population size, foreign countries’ GDP per capita, Czech export value, and foreign import value from the Czech Republic all associated with the count of domains.
However, when we switched the perspective and used data where the count of domains was aggregated by autonomous systems, only GDP per capita and the value of Czech export were estimated to associate with the count of domains. On the basis of observed effect sizes and model performance indices, we may summarize that GDP per capita and the value of Czech export best explain the variances in the domain holders’ tendency to set up their mail servers abroad.
Finally, we also modeled whether the count of domains with MX records within autonomous systems that are operated by companies with domicile in the Czech Republic are connected to the financial turnover of such companies. We found out that they are indeed positively associated.
Appendix
Models
Here, we present posterior predictive checks, multicollinearity checks, and model summaries of models skipped in the main text.
Model 1
Code
color_scheme_set("brightblue")pp_check(m1, type ="bars", ndraws =20)pp_check(m1, type ="bars", ndraws =20) +xlim(-1,20)
(a) Posterior predictive check.
(b) Posterior predictive check (detail).
Figure 12: Model 1 diagnostics.
Code
check_collinearity(m1)
# Check for Multicollinearity
Moderate Correlation
Term VIF VIF 95% CI Increased SE Tolerance Tolerance 95% CI
log(gdp + 1) 5.63 [ 4.46, 7.19] 2.37 0.18 [0.14, 0.22]
High Correlation
Term VIF VIF 95% CI Increased SE Tolerance
log(distances + 1) 10.77 [ 8.41, 13.87] 3.28 0.09
log(population + 1) 28.65 [22.17, 37.12] 5.35 0.03
log(export + 1) 60.39 [46.58, 78.39] 7.77 0.02
log(import + 1) 29.20 [22.59, 37.83] 5.40 0.03
Tolerance 95% CI
[0.07, 0.12]
[0.03, 0.05]
[0.01, 0.02]
[0.03, 0.04]
Code
summary(m1)
Family: zero_inflated_negbinomial
Links: mu = log; shape = identity; zi = logit
Formula: mx ~ log(distances + 1) + log(population + 1) + log(gdp + 1) + log(export + 1) + log(import + 1)
zi ~ log(distances + 1) + log(population + 1) + log(gdp + 1) + log(export + 1) + log(import + 1)
Data: df_cc2 (Number of observations: 188)
Draws: 4 chains, each with iter = 4000; warmup = 1000; thin = 1;
total post-warmup draws = 12000
Regression Coefficients:
Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
Intercept -28.10 5.40 -38.81 -17.65 1.00 5986 7296
zi_Intercept 1.97 45.83 -58.23 127.47 1.00 2406 679
logdistancesP1 1.70 0.62 0.50 2.90 1.00 3554 5334
logpopulationP1 -1.09 0.38 -1.84 -0.37 1.00 3392 5068
loggdpP1 0.71 0.49 -0.26 1.64 1.00 3615 4968
logexportP1 1.72 0.50 0.71 2.70 1.00 3803 4800
logimportP1 -0.09 0.28 -0.62 0.46 1.00 5556 6708
zi_logdistancesP1 0.27 5.32 -17.66 5.62 1.00 1130 418
zi_logpopulationP1 3.01 3.20 0.09 12.12 1.00 862 489
zi_loggdpP1 2.15 4.51 -1.31 15.44 1.00 1059 562
zi_logexportP1 -3.24 3.86 -15.15 -0.20 1.00 823 406
zi_logimportP1 -1.24 0.92 -3.46 -0.15 1.00 1564 780
Further Distributional Parameters:
Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
shape 0.20 0.03 0.15 0.27 1.00 2303 3129
Draws were sampled using sampling(NUTS). For each parameter, Bulk_ESS
and Tail_ESS are effective sample size measures, and Rhat is the potential
scale reduction factor on split chains (at convergence, Rhat = 1).
Model 2
Code
pp_check(m2, type ="bars", ndraws =20)pp_check(m2, type ="bars", ndraws =20) +xlim(-1,20)
(a) Posterior predictive check.
(b) Posterior predictive check (detail).
Figure 13: Model 2 diagnostics.
Code
summary(m2)
Family: zero_inflated_negbinomial
Links: mu = log; shape = identity; zi = logit
Formula: mx ~ log(distances + 1)
zi ~ log(distances + 1)
Data: df_cc2 (Number of observations: 188)
Draws: 4 chains, each with iter = 4000; warmup = 1000; thin = 1;
total post-warmup draws = 12000
Regression Coefficients:
Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
Intercept 12.58 1.87 9.22 16.48 1.00 10781 7353
zi_Intercept -12.89 3.14 -19.92 -7.59 1.00 8591 4960
logdistancesP1 -0.54 0.23 -1.01 -0.13 1.00 9834 7468
zi_logdistancesP1 1.44 0.35 0.82 2.20 1.00 9251 5369
Further Distributional Parameters:
Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
shape 0.10 0.02 0.06 0.14 1.00 6430 7416
Draws were sampled using sampling(NUTS). For each parameter, Bulk_ESS
and Tail_ESS are effective sample size measures, and Rhat is the potential
scale reduction factor on split chains (at convergence, Rhat = 1).
Figure 14: Associations between geographic distance and domains with MX records.
Model m2 estimated that geographic distance associates positively with domains’ zero-inflation and negatively with the count of domains using MX records.
Model 3
Code
pp_check(m3, type ="bars", ndraws =20)pp_check(m3, type ="bars", ndraws =20) +xlim(-1,20)
(a) Posterior predictive check.
(b) Posterior predictive check (detail).
Figure 15: Model 3 diagnostics.
Code
summary(m3)
Family: zero_inflated_negbinomial
Links: mu = log; shape = identity; zi = logit
Formula: mx ~ log(population + 1)
zi ~ log(population + 1)
Data: df_cc2 (Number of observations: 188)
Draws: 4 chains, each with iter = 4000; warmup = 1000; thin = 1;
total post-warmup draws = 12000
Regression Coefficients:
Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
Intercept 1.49 3.49 -5.15 8.42 1.00 6849 7259
zi_Intercept 10.61 5.82 0.16 23.09 1.00 5122 4766
logpopulationP1 0.40 0.22 -0.02 0.82 1.00 7076 7370
zi_logpopulationP1 -0.86 0.42 -1.80 -0.16 1.00 4752 4214
Further Distributional Parameters:
Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
shape 0.06 0.01 0.04 0.08 1.00 6114 6092
Draws were sampled using sampling(NUTS). For each parameter, Bulk_ESS
and Tail_ESS are effective sample size measures, and Rhat is the potential
scale reduction factor on split chains (at convergence, Rhat = 1).
Figure 16: Associations between population size and domains with MX records.
Model m3 estimated that population size associates negatively with domains’ zero-inflation and positively with the count of domains using MX records (albeit the 95% credible interval contained zero, suggesting uncertainty over the existence of such effect).
Model 4
Code
pp_check(m4, type ="bars", ndraws =20)pp_check(m4, type ="bars", ndraws =20) +xlim(-1,20)
(a) Posterior predictive check.
(b) Posterior predictive check (detail).
Figure 17: Model 4 diagnostics.
Model 5
Code
pp_check(m5, type ="bars", ndraws =20)pp_check(m5, type ="bars", ndraws =20) +xlim(-1,20)
(a) Posterior predictive check.
(b) Posterior predictive check (detail).
Figure 18: Model 5 diagnostics.
Model 6
Code
pp_check(m6, type ="bars", ndraws =20)pp_check(m6, type ="bars", ndraws =20) +xlim(-1,20)