31 JUL 2017

How do you build an efficient Big Data architecture?

The quantity and formats of data on the network, especially gradually that more devices are connected to the Internet (devices of all kinds according to the Internet of Things – IoT – paradigm), is growing exponentially and this Big Data cannot be processed using traditional analysis techniques, such as structured databases.

Big Data is very large and complex data sets, both structured and unstructured, that needs different and adequate tools for its storage, research, transfer, analysis and visualization. In fact, data can be divided into:

  • structured data: data has a structure, like the tables of a database or, more simply, the phonebook of a mobile phone in which at each name is associated one or more telephone numbers;
  • unstructured data: like texts, documents in specific formats (excel, powerpoint, etc.), images and videos.

But what does it mean more specifically for Big Data? Generally, the term Big Data means a collection of data so large, complex and variable that it cannot be treated and managed with traditional tools. The main features of Big Data can be summarized in the famous four “V”:

  • Volume: we are talking about the amount of data not manageable in traditional databases;
  • Velocity: data is generated quickly and requires to be processed at near real-time or even real-time rhythm;
  • Variety: we are speaking about elements of different nature and formats, not necessarily structured;
  • Veracity: data quality becomes a fundamental requirement so that it can effectively have a value to allow to make predictive and preventive decisions.

All these features make that Big Data does not lend itself to a traditional type of analysis, such as through relational databases, but it requires the use of massively parallel software running in a distributed way.

Generally, computer operating systems are designed to work with data stored locally: if data has a big size and for this reason it cannot be stored on a single computer, as in the case of Big Data, then you need an ad hoc architecture to process this data. For example, if we have a few books, we can put them in one or more bookstores inside our house. If books become tens of thousands, the solution is not to have hundreds of houses each with their own bookstores: we need to build a library – an ad hoc structure, designed for this purpose, capable of managing tens of thousands of books in an appropriate manner.

In 2004, Google published an article where it described a new architecture called MapReduce, a framework designed to provide a parallel processing and distributed implementation model for processing large amounts of data. Later MapReduce algorithm was an inspiration for the creation of new solutions, such as Hadoop, an open source project by Apache, which even now is the most widespread tool for processing Big Data.

Over time, new solutions have emerged, each with its own special features: Apache Hive, which allows you to query different databases stored in a distributed manner on different machines using SQL; Apache Pig, a platform that allows the analysis of Big Data using a high-level language together with an infrastructure capable of interpreting this language, exploiting, if possible, the opportunity to parallelize the calculation. The most used is still Apache Hadoop, an open source implementation of Google MapReduce platform, which owes its name to two language commands, Map and Reduce. In Hadoop/MapReduce it is very easy to specify tasks in parallel (through the first phase, that of Map) and then how data must be aggregated (through the Reduce phase). Hadoop/MapReduce is appropriate for batch computations on large data (Volume), while for streaming data (Velocity) there are other more suitable platforms, such as Apache Spark and Apache Storm.

Usually all the architecture, software and storage for Big Data are hosted on Cloud Computing platforms, because they offer computational resources in a flexible way. Thanks to virtualization, a mechanism that allows you to use a single server to host different virtual servers, configurable and manageable in a very flexible way, cloud can offer to users resources based on their needs, in real time. If a small computer company launches a new app, which needs to interact with a server, and the server is a single machine, as soon as the number of users grows beyond the limit of those manageable by the machine, the machine crashes and the app stops working. Thanks to cloud, companies can use different virtual servers, and in case of need, they can start other servers to be able to resist peaks in users’ requests. This mechanism, that is the possibility to configure the machines in a very flexible way, is fundamental for the treatment of Big Data, where it can make comfortable experimenting with different architectural solutions before finding the one that suits your needs.

All these tools allow to convert the large data set into useful information, to catalog and manage it in the most efficient way, with the aim, in particular for companies, to extrapolate new knowledge that can guide their strategies and their business choices.

Elaborated by Lucia D’Adamo, in collaboration with Luigi Laura, supervised by Marco Pirrone