Hadoop, what is it and how it works
Hadoop is a platform designed to handle and analyze large amounts of data in a business-oriented perspective. Hadoop is one of the first framework to work on Big Data and it is still one of the most reliable.
Hadoop is an Open Source implementation of a platform which belongs to Google, MapReduce. Its operation retraces that of MapReduce, which in brief reduces the processing of data into two distinct phases that take the names from two commands of the Lisp language: Map and Reduce. In the Map phase the data are processed individually, while in the Reduce phase the data that have “affinity” (depending on the type of calculation you want to do) are processed together.
What are the components of Hadoop?
In Hadoop there are three main components:
- HDFS: a distributed file system, that is designed to be used on computers connected to each other on the network;
- MapReduce: which is the real environment;
- YARN: which manages resources and controls its execution.
What are the analytics of Hadoop?
Hadoop has a lot of analytics, the use of which depends on the specific features required. We want to mention HIVE and PIG. HIVE is a data warehouse infrastructure that supports data summaries, queries and analysis. On the other hand, Pig is a platform which offers a high-level language for querying data and the infrastructure for processing programs.
Commercial distributions of Hadoop, what tools are included?
The distributions include a large number of tools. Between these:
- Spark: which is a large data processing engine alternative to Hadoop;
- Kafka: a real-time engine;
- Impala: which is the native analytical database of Hadoop;
- Flume: used to collect and process logs.
From Hadoop to Spark: which evolution?
Spark is a large data processing engine alternative to Hadoop. Spark processes data which has in his memory and it has been shown that, for some types of tasks, it can be even 100 times faster than Hadoop.