Introduction

CZ.NIC uses the Hadoop stack to store traffic data from our authoritative DNS servers for .cz domain, traffic data from our ODVR public resolvers and other statistical data. Traffic data is collected on the servers by our own dns-probe tool and then sent to the Hadoop stack. For this purpose we run our own small cluster of 7-8 servers. The data processing pipeline is depicted in Figure 1.

CZ.NIC’s pipeline for processing DNS traffic data

Until this year, we had been a satisfied user of Cloudera Express, which was a free distribution of Hadoop provided by Cloudera. This distribution contained packages for most common components of the Hadoop stack and tooling, such as Cloudera Manager, to make deployment, configuration and monitoring of Hadoop easy. Unfortunately, Cloudera decided to stop offering this free distribution of Hadoop, including its already released versions. From 1st February 2021, only their subscription-based Cloudera Enterprise distribution is available.

Coincidentally, around this time we were also preparing to upgrade the server hardware of our Hadoop cluster and preparing to deploy a fresh installation of Hadoop to this new hardware. Cloudera’s changes led us to explore alternative ways of deploying Hadoop to our locally managed servers, with a strong preference for free and open-source solutions.

Hadoop ecosystem

The Hadoop ecosystem consists of the base Hadoop component and many other independently developed tools built on top of it. This makes the Hadoop stack very flexible in terms of putting together a specific set of tools for a given use case. Our Hadoop stack consisted of the following components:

Apache Hadoop (HDFS + MapReduce + YARN)
Apache Zookeeper
Apache Spark
Apache Hive
Apache Impala (replaced with Trino in our new cluster)
Apache Parquet

All these components were deployed and monitored by Cloudera Manager, which is Cloudera’s tool for provisioning Hadoop clusters.

The base Hadoop component provides a distributed HDFS filesystem, the YARN framework for job scheduling and cluster resource management, and MapReduce – a YARN-based system for parallel processing of large datasets.

Apache Zookeeper is a centralized service for maintaining configuration information, naming, taking care of distributed synchronization, and providing group services. Services like HDFS or Hive can use it to synchronize configuration data among several servers, or elect active and stand-by nodes in case of high-availability setup.

Apache Spark is an analytics engine for large-scale data processing. We leverage its R language integration for interactive operations on our data from R.

Apache Hive is a data warehouse software that facilitates reading, writing and managing large datasets in distributed HDFS storage using SQL.

Apache Impala provides low latency and high concurrency SQL queries over Hadoop data. It integrates with Hive’s data warehouse and uses its metadata to quickly query over data in HDFS.

Apache Parquet is a columnar storage format based on Google’s Dremel paper designed for efficient storage of data in HDFS and fast and efficient access to it. We store all our DNS traffic data in Hadoop in this format.

Options for Hadoop deployment

We looked into open-source alternatives of deploying Hadoop and ended up with three possible solutions:

Most of the Hadoop components are developed under Apache Software Foundation. These projects typically don’t offer pre-built packages for common Linux distributions but only release source code tarballs and sometimes tarballs with pre-built binaries (for projects written in Java). The first option of deploying plain Hadoop would then involve a lot of manual drudgery: building the components from sources and installing them on each server. Then we would also have to manually set up all configuration files for each component. This could be made easier with some automation tools like Ansible but would still require a lot of work to set everything up properly. Updating components in a cluster deployed this way would also be complicated and quite time consuming. For these reasons we considered this option only as a last resort in case we wouldn’t be able to find a better solution.

Second option we explored was Apache Ambari, which is a Hadoop provisioning tool similar to Cloudera Manager. This tool provides a web interface to deploy, configure and monitor Hadoop and its many components from a central location. Ambari was used by Hortonworks for their distribution of Hadoop before the company was aquired by Cloudera. This tool however doesn’t work on its own. It needs to be supplemented with a package repository containing Linux packages of Hadoop components and a Stack Definition – a collection of configuration files and scripts that describes in detail how exactly should Ambari install, configure and monitor each package in the Hadoop package repository. Firstly, we began searching for Hadoop package repositories, as attempting to package Hadoop components ourselves seemed like too much effort. Cloudera provides one for their Hadoop distribution, but access to it now requires a paid subscription. The only viable free repository we could find is Apache Bigtop project that aims at providing a comprehensive platform for packaging, testing and virtualization of the Hadoop stack. Most importantly for us, it offers package repositories for some of the most common Linux distributions including Debian 10, which is our current distribution of choice. Bigtop’s repositories contain packages for all Hadoop components we use except Impala. We decided that having to deploy just one Hadoop component manually was acceptable and proceeded with testing Ambari.

Unfortunately, we hit a roadblock with Ambari’s stack definition for Bigtop and then even with Ambari itself. At the time of writing, there isn’t a complete stack definition for Bigtop repositories available for Ambari. There are some efforts, e.g. here and here, to put together a stack definition for Bigtop, but none of them in a production-ready state. These years-lasting efforts together with lacking documentation of Ambari itself convinced us that it wasn’t realistic to quickly develop a stack definition on our own. On top of that, we also discovered that Ambari (even in its latest release 2.7.5) doesn’t really support Debian 10 yet, and the development activity is quite low. With some effort, we were able to compile Ambari’s source code for Debian 10, but then we weren’t able to successfully deploy even base Hadoop with it.

Discovery of Bigtop led us to consider another deployment option that we eventually adopted. This option consisted of deploying individual Hadoop packages from Bigtop repository ourselves using some automation tool like Ansible, and then a separate monitoring solution such as Icinga. Bigtop itself comes with a collection of Puppet classes for deployment. We tested these classes and even though they work quite well, we ultimately decided not to use them. We would have to make a lot of changes to adjust them to our needs and, moreover, Ansible is the tool of choice at CZ.NIC when it comes to similar automation tasks. So we ended up with creating a set of Ansible playbooks for installing and configuring Hadoop packages from Bigtop Debian 10 repository and setting up Icinga to monitor the resulting server cluster.

Deploying Hadoop stack with Apache Bigtop

For our new Hadoop stack we used a Debian 10 repository of Apache Bigtop 1.5.0 release. As mentioned earlier, this repository contains all Hadoop components we use except Impala, which will be discussed later. Our goal was to create a separate Ansible playbook for deploying each Hadoop component on all servers. Each playbook had to handle several steps:

install and set up prerequisites for the Hadoop component
install component’s package from Bigtop repository
deploy component’s configuration files prepared in advance for our use case
enable and start component’s systemd or init.d service

These steps were pretty much the same as in Bigtop’s Puppet classes that we heavily used as inspiration. The most important step was to prepare configuration files for each Hadoop component included in our cluster topology. This usually entailed specifying CPU and memory limits and enabling specific features of the given component. We again used Bigtop’s Puppet classes as an inspiration for creating our configuration files in tandem with official documentation for each Hadoop component, which usually contains minimal configuration examples. For some of the playbooks we also had to guarantee that they are idempotent, because some steps can be done only once. For example, formating an HDFS file system has to be done only during initial installation.

Impala/Trino deployment

The last Hadoop component we tried to deploy was Impala, which wasn’t packaged in Bigtop’s repositories. There aren’t even any pre-built binaries available as Impala is mostly written in C++. Our only option was to try to build Impala from source. This however proved to be beyond our capabilities. The main reason was, as with Ambari, that the latest released version of Impala at the time didn’t officially support Debian 10. We even tried to build the latest code from master branch of Impala’s repository, but the Debian 10 support issue was still present there. The build process of Impala is quite complicated and the documentation for it is very sparse, so our attempts to add Debian 10 support failed. Moreover, Impala’s build process pulls a lot of dependencies from Cloudera’s online repositories and some of them became unavailable during our build efforts. This is because even though Impala is now officially under Apache Foundation, it was originally a Cloudera project and is still very tied to the Cloudera’s ecosystem.

As our Hadoop workflow is very dependent on a fast SQL on Hadoop tool, we started to look for possible alternatives and came up with this list of open-source projects:

Our requirements for a good Impala alternative were mainly a comparable speed of SQL queries over large HDFS datasets, and a good integration with Hive Metastore’s metadata.

First option we tested was Drill, which is another Apache project in the Hadoop ecosystem. Drill’s big advantage was that it could query directly over Parquet files in HDFS without having to specify schema and table definitions in Hive. This method however proved to be quite slow, so we moved on to test its Hive integration. The speed of SQL queries over Hive tables was quite close to Impala’s performance, but a crucial deficiency turned out to be Drill’s lack of support for table views in Hive. Tested Drill version 1.18 completely lacked even read-only capabilities for table views created in Hive, which we use quite extensively.

The second and third options we tried were Presto and Trino. Presto is an SQL engine originally developed at Facebook. Trino is Presto’s fork that is currently developed by Presto’s original team independently of Facebook. Because of this, the speed of query execution in Presto and Trino was practically identical, but, more importantly, the performance was on par with Impala. Hive integration was again very similar in both and very decent, and, unlike Drill, also included the ability to read table views. Ultimately, our decision to choose Trino over Presto was mostly based on our perception of a more active community around it and the involvement of people behind the creation of Presto.

In order to facilitate Trino deployment, we again put together an Ansible playbook. Trino doesn’t come with a pre-built package in a repository, but offers tarballs with pre-built Java binaries. Our playbook downloads these pre-built binaries and deploys them to the servers, then copies our Trino configuration files and a simple systemd unit file to the servers and starts the Trino service. It also downloads and deploys a separate JAR file that serves as Trino’s command line client.

Migrating data between clusters

Once we deployed our Hadoop environment, we had to migrate our HDFS data from the old cluster to the new one. As both clusters were in the same server rack connected to a switch with 10 Gb links, we decided to simply use Hadoop’s DistCp tool to copy roughly 50 TB of HDFS data over the network. Once the HDFS data was migrated to the new cluster, we also needed to rebuild our Hive tables created on this data, which was done with a few simple CREATE TABLE and CREATE VIEW statements.

Impala to Trino migration

In our testing environment, no big differences in Trino and Impala performance have been observed. The biggest change is metadata management. Compared to Impala, Trino isn’t so tightly integrated with Hive, and it is sometimes needed to execute commands directly in Hive (e.g. creating/dropping partitions, etc.). There are also some differences in SQL syntax:

Trino has a slightly different type coercion policy, and it is sometimes needed to use CAST(... AS type) in order to get the desired result. For example, we would use count(*)/86400 AS qps to compute daily mean QPS, but in Trino count(*) is of Integer type and division by 84600 still yields Integer instead of Float. This issue was fixed by changing SQL select to cast(count(*) as double)/86400 AS qps.
Some functions are missing in Trino, but usually a suitable replacement can be found. For example, Impala’s APP_MEDIAN(...) function doesn’t exist in Trino, but we were able to get the same result with APPX_PERCENTILE(..., 0.5).
Trino uses ' instead of " for strings
PARTITION statement in SQL insert is not used in Trino

For data analysis we often use the R language. In order to access Trino in R we switched from implyr to the RPresto library.

Managing and monitoring the cluster

The only remaining features of Cloudera Express that we needed to substitute were real-time monitoring of the Hadoop cluster and automated notifications in case of an issue. The solution that we originally preferred was to use a Hadoop-specific service like Apache Ambari, but without being able to deploy it along with Hadoop we had to look into more general services. We briefly considered services like Prometheus and Nagios but ultimately decided to use Icinga, mainly because we have already been using it internally for other tasks and had some experience with it.

We deployed Icinga directly to our cluster servers. The HDFS namenodes of our cluster served as master instances of Icinga configured in high-availability setup and the remaining servers were configured as simple Icinga agents. With this setup we are able to monitor basic metrics (CPU, memory, network) for all servers as well as the state of systemd/init.d services of individual Hadoop components. For more specific monitoring of Hadoop services, there’s an extensive collection of Hadoop plugins on Icinga Exchange.

To manage individual Hadoop components (start/stop/restart services, deploy new configuration) in case of an issue, we put together a few simple scripts that again leverage Ansible to perform these operations on all or a few selected servers in the cluster.

Conclusions

We were able to successfully migrate our Hadoop cluster from Cloudera Express to a free and open-source solution consisting of four components, namely Apache Bigtop, Trino, Icinga and Ansible. We’ve been using this new cluster in production since June 2021 and are quite happy with how it works for our use case. It is quite clear, however, that deploying and managing this setup is somewhat harder than using a production-ready commercial solution, such as Cloudera’s Hadoop distribution. Our current Hadoop cluster consists of only 8 servers, which is a number that can still be reasonably managed manually, should we ever run into problems with our scripts and playbooks. Arguably, managing larger clusters in the same way might become difficult and brittle, and a larger upfront time investment might be needed in order to prepare a more robust set of Ansible playbooks or, for that matter, another automation platform of choice.

Other ADAM reports » Další reporty »

Migrating Hadoop from Cloudera to Apache Bigtop

ADAM Report 2/2021

P. Doležal, M. Andziński

2 September 2021