Decision Tree Algorithm and Gini Index using Python
Decision Tree Classification Algorithm is used for classification based machine learning problems. It is one of the most popular algorithm as the final decision tree is quite easy to interpret and explain. More advanced ensemble methods like random forest, bagging and gradient boosting are having roots in decision tree algorithm. Here we will try to understand the basic mathematics behind decision tree and concept of Gini index.
In this tutorial we are going to considerDecision Tree as classification Algorithm and Thus needs to know about four step process of it which is mentioned below: –
- Consider all predictors and all possible cut points
- Calculate Gini Index for each possibility
- Select the one with least Gini Index
- Continue till stopping criteria reached.
Suppose we are having n number of samples in a set of data representing two categories 0 and 1. Then we divide the data into two sub sets in such a way that it can form pure subsets. Which means in any group or subset only one type of category is present with defined sets of features.
In dividing a data into pure subset Gini Index will help us.
Gini Index
A Gini is a way to calculate loss in case of Decision tree classifier which gives a value representing how good a split is with respect to mixed classes in two groups created by split. A perfect split is represented by Gini Score 0, and the worst split is represented by score 0.5 i.e. each group is having 50/50 classes in case of two class problem.
Suppose we are gaining 4 samples representing 2 categories 0 and 1.
Feature. Label Sample 1: - [ x1 , 0] Sample 2: - [ x2 , 0] Sample 3: - [ x3 , 1] Sample 4: - [ x4 , 1]
So here the predictors are containing 4 samples and initially we will divide the data samples in two groups as given below.
In the process of Calculating Gini Index, the first thing that we calculate is probability of each category in each subset.
probability = count(class_value) / no. of samples
In this example the probability will be
subset_1_class_0 = 1/2 = 0.5 subset_1_class_1 = 1/2 = 0.5 subset_2_class_0 = 1/2 = 0.5 subset_2_class_1 = 1/2 = 0.5
The Gini Index then will be calculated for each subset as follows:
Gini_index = 1.0 - sum(probability * probability)
Then Gini Index for each group will be weighted by size of the group relative to all sample in the data set as mentioned below.
Gini_index = (1.0 - sum (probability * probability)) * (group_size / total_samples)
Now we will calculate Gini index for the example as mentioned below.
Gini(Subset_1) = (1.0 - (0.5*0.5 + 0.5*0.5)) * 2/4 Gini(Subset_1) = 0.5 * 0.5 Gini(Subset_1) = 0.25 Gini(Subset_2) = (1.0 - (0.5*0.5 + 0.5*0.5)) * 2/4 Gini(Subset_2) = 0.5 * 0.5 Gini(Subset_2) = 0.25
The calculated scores across each subset will be added to get final Gini Score for the split which will be compared to other candidate split points to tune the best possible condition with Gini Score 0.
So final Gini score in this case would be 0.25 + 0.25 = 0.5, which is not the best one. So candidates in subset needs to be rearranged as follows.
In this car if we want to calculate Gini Index then it will be like.
#probabilities-- subset_1_class_0 = 2/2 = 1 subset_1_class_1 = 0/2 = 0 subset_2_class_0 = 0/2 = 0 subset_2_class_1 = 2/2 = 1 #Gini indicies Gini(Subset_1) = (1.0 - (1*1 + 0*0) * 2/4 Gini(Subset_1) = 0.0 * 0.5 Gini(Subset_1) = 0.0 Gini(Subset_2) = (1.0 - (0*0 + 1*1)) * 2/4 Gini(Subset_2) = 0 * 0.5 Gini(Subset_2) = 0 #Final Gini Score Gini(Subset_1) + Gini(Subset_2) = 0 + 0 = 0
So we got a perfect Gini score representing a perfect split of data.
More data beats clever algorithm but better data beats more data.
Peter Nerving
Let’s create a program for it. Define a dataset divided into two groups and the category or classes as well as shown below. x1,x2,x3 and x4 are some random features along with labels represented by 0 and 1.
groups = [[['x1', 1], ['x2', 1]], [['x3', 0], ['x4', 0]]] classes = [0, 1] print(groups)
[[['x1', 1], ['x2', 1]], [['x3', 0], ['x4', 0]]]
Now write a line of code to count total number of instances or total sample size.
instances = 0 instances += sum([len(group) for group in groups]) print(instances)
4
Now we will write a program to calculate Gini index based upon discussion done above.
#initilize gini = 0.0 for group in groups: #Define no. of elements in each group size = float(len(group)) #skip the instances in case if any group has been created #without any value if (size == 0): continue #Calculate the gini score score = 0.0 for class_val in classes: p = [row[-1] for row in group].count(class_val) / size score += p * p gini += (1.0 - score) * (size / instances) print(gini)
0.0
It is representing best possible division of data with each groups are containing only one type of value i.e. either 0 or 1.
Let’s define a dataset to understand the split. In the following dataset the first two columns are representing the features and the last column is representing the label.
dataset = [[3.999208922,1.209014212,0], [8.497545867,2.162953546,1], [10.00220326,2.339047188,1], [2.728571309,0.169761413,0], [4.678319846,1.81281357,0], [11.12493903,2.234550982,1], [3.771244718,0.784783929,0], [4.961043357,1.61995032,0], [7.642287351,2.319983761,1], [8.444542326,1.476683375,1]]
Now we will write a code to divide the data into groups. Here we are having 10 samples and 2 index (0,1) So 20 divisions will be formed containing 2 subsets each (left and right). The division with 0.0 as Gini score will be considered as best case division.
classes = list(set(row[-1] for row in dataset)) f_index, f_value, f_score, f_groups = 999, 999, 999, None #loop to compare elements of each column for index in range(len(dataset[0])-1): for r in dataset: left, right = [], [] for row in dataset: if row[index] < r[index]: left.append(row) else: right.append(row) groups = [left,right]
Now we will extend our code to calculate Gini Index of each possible division of data.
classes = list(set(row[-1] for row in dataset)) f_index, f_value, f_score, f_groups = 1, 1, 1, None for index in range(len(dataset[0])-1): for r in dataset: left, right = [], [] for row in dataset: if row[index] < r[index]: left.append(row) else: right.append(row) groups = [left,right] ni = float(sum([len(group) for group in groups])) gini = 0.0 for group in groups: size = float(len(group)) if (size == 0): continue score = 0.0 for class_val in classes: p = [row[-1] for row in group].count(class_val) / size score += p * p gini += (1.0 - score) * (size / ni) print('X%d < %.3f Gini=%.3f' % ((index+1), r[index], gini)) if gini < f_score: f_index, f_value, f_score, f_groups = index, r[index], gini, groups print('index',f_index, 'value',f_value, 'groups',f_groups)
#Output X1 < 3.999 Gini=0.375 X1 < 8.498 Gini=0.286 X1 < 10.002 Gini=0.375 X1 < 2.729 Gini=0.500 X1 < 4.678 Gini=0.286 X1 < 11.125 Gini=0.444 X1 < 3.771 Gini=0.444 X1 < 4.961 Gini=0.167 X1 < 7.642 Gini=0.000 X1 < 8.445 Gini=0.167 X2 < 1.209 Gini=0.375 X2 < 2.163 Gini=0.167 X2 < 2.339 Gini=0.444 X2 < 0.170 Gini=0.500 X2 < 1.813 Gini=0.320 X2 < 2.235 Gini=0.286 X2 < 0.785 Gini=0.444 X2 < 1.620 Gini=0.417 X2 < 2.320 Gini=0.375 X2 < 1.477 Gini=0.286 index 0 value 7.642287351 groups [[[3.999208922, 1.209014212, 0], [2.728571309, 0.169761413, 0], [4.678319846, 1.81281357, 0], [3.771244718, 0.784783929, 0], [4.961043357, 1.61995032, 0]], [[8.497545867, 2.162953546, 1], [10.00220326, 2.339047188, 1], [11.12493903, 2.234550982, 1], [7.642287351, 2.319983761, 1], [8.444542326, 1.476683375, 1]]]
Here we can conclude that division of subsets on the basis of x1 < 7.642 is best case scenario, because of Gini Index = 0. And at the end we can compare for minimum possible Gini Index and replace f_index with the feature index resulting minimum possible Gini Index and f_value with the value resulting minimum Gini Index and f_groups with groups formed at the end.
Now we can make the predictions by comparing the features of unknown input value with the cut points responsible for 0.0 Gini Index and pure subsets.
Suppose k is a value with two features and we have to make predictions. We can make prediction as follow by using the f_index which is responsible for best case splitting of data points.
k = [7.8319846, 2.81281357] if k[f_index] < f_value: g = 0 pred = [row[-1] for row in f_groups[g]] print(max(pred , key=pred.count)) else: g = 1 pred = [row[-1] for row in f_groups[g]] print(max(pred , key=pred.count))
#Output 1
It shows the value represented by k belongs to category 1.
Decision Tree Classifier Using Scikit-Learn
Scikit-Learn gives us inbuilt class as DecisionTreeClassifier which can be used to train your model and make predictions in just 4 lines of code. It needs separate variables for features and labels. So at first we will separate our data into features and labels as follows:
features = [] labels = [] for row in dataset: features.append(row[0:2]) labels.append(row[2:3])
Import the library and assign a object as model. Train your model using .fit() function.
from sklearn.tree import DecisionTreeClassifier dmodel = DecisionTreeClassifier() dmodel.fit(features,labels)
Plot your tree using inbuilt library Scikit-Learn.
from sklearn.tree import export_graphviz dot_data = export_graphviz(dmodel) from IPython.display import Image import pydotplus graph = pydotplus.graph_from_dot_data(dot_data) Image(graph.create_png())
Make the prediction over same value of k defined in first case and compare the predicted output. Is it same ?
print(dmodel.predict([k])) #Output 1
Thanks for your Reading. Please do read the tutorial of Decision Tree Classifier explaining Stopping Criteria for the growth of trees.
We welcome your suggestions. Keep Learning….