17 JAN 2018

What is a decision tree?

A decision tree is a model which allows to take a decision based on a classification system.

What is a decision tree?

Decision trees are a technique to resolve some types of classification problems. In classification problems generally we have a set of data, which are called “historic” because they are already in our possession and experts of the application domain have already classified them. This time we want to be able to classify the new data instances.

The big advantage is its interpretability: unlike many data analysis techniques, often based on weighted sums of values, the criterion used by a decision tree is very clear. In each node of the tree there is a criterion by which one proceeds through one of the underlying branches, until a decision is reached.

To understand how a decision tree works, we will use the example of spam.

A classic example of a classification problem is being able to distinguish “authentic” e-mail messages from spam.

A classification algorithm looks for regularities in the data that can be exploited to recognize the new data. Always referring to the problem of spam, we look for characteristic traits, such as the presence of different links or grammatical errors, to be able to identify with certainty the e-mail as spam.

Which is the relationship between data mining and decision trees?

Decision trees are one of the many techniques used in Data Mining, that is the statistical analysis whose purpose is the semi-automatic extraction of knowledge hidden in voluminous databases in order to make it available and directly usable.

What is the difference between classification and regression?

Decision trees, according to the nature of the response variable, are called classification trees or regression trees. In the first case the response variable is qualitative (e.g. Yes/No), while in the second case it is quantitative, for example a number. In classification problems I have to assign a category to the new data that I receive, for example I have to decide whether an e-mail is spam or not. In regression problems, on the other hand, I have to estimate a numerical value: for example, I have to assign the value to a used car on the basis of a price database which contains prices that I used in the past to sell used cars.

What is it in brief and what innovations the “random forest” technology has brought?

As we have already said, one of the advantages of decision trees is the production of clear classification rules and their ease of interpretation. However, they often present poor predictive performance. To overcome this, model-assembly technologies have been developed, in which the training phase (tree training based on existing data) is carried out on different set training chosen randomly. One of the techniques used is the one called “random forest”.

The “random forest” technique consists in using, for a given classification problem, different decision trees, to be used together: then the decision chosen by the majority of trees is used as the final decision. This technique allows you to have different points of view (decision trees) of the same problem and this, for several problems, guarantees better results. However, it is lost the ability to generate those rules that make decision trees a tool easy to interpret.

Edited by Lucia D’Adamo, in collaboration with Luigi Laura, supervised by Marco Pirrone