Solution of Chapter 6: Decision Tree

4 min readDec 18, 2021

Decision tree algorithm generates a model from the train set by splitting the train set into subtrees of nodes until it reaches the max depth that has been predefined or until it can not split to find impurity.

A node’s samples attribute counts how many training instances it applies to. For example, 100 training instances have a petal length greater than 2.45 cm (depth 1, right), among which 54 have a petal width smaller than 1.75 cm (depth 2, left).

A node’s value attribute tells you how many training instances of each class this node applies to: for example, the bottom-right node applies to 0 Iris-Setosa, 1 Iris-Versicolor, and 45 Iris Virginica.

A node’s Gini attribute measures its impurity: a node is “pure” (gini=0) if all training instances it applies to belong to the same class. For example, since the depth-1 left node applies only to Iris-Setosa training instances, it is pure and its gini score is 0.

1. What is the approximate depth of a Decision Tree trained (without restrictions) on a training set with 1 million instances?

The depth is not always fixed and it can be modified based on the requirement. The approximate depth can be depended on the impurity. Until the impurity remains in the data set then the depth value can be set.

2. Is a node’s Gini impurity generally lower or greater than its parent’s? Is it generally lower/greater, or always lower/greater?

A node’s Gini attribute measures its impurity: a node is “pure” (gini=0) if all training instances it applies to belong to the same class. Normally a node’s Gini impurity is generally lower than its parent’s.

3. If a Decision Tree is overfitting the training set, is it a good idea to try decreasing max_depth?

To avoid overfitting the training data, you need to restrict the Decision Tree’s freedom during training. The regularization hyperparameters depend on the algorithm used, but generally, you can at least restrict the maximum depth of the Decision Tree.

So, yes reducing max_depth will regularize the model and thus reduce the risk of overfitting.

4. If a Decision Tree is underfitting the training set, is it a good idea to try scaling the input features?

One of the many qualities of Decision Trees is that they require very little data preparation. In particular, they don’t require feature scaling or centering at all. So, NO.

5. If it takes one hour to train a Decision Tree on a training set containing 1 million instances, roughly how much time will it take to train another Decision Tree on a training set containing 10 million instances?

Scikit-Learn uses the Classification And Regression Tree (CART) algorithm to train Decision Trees (also called “growing” trees).

The idea is really quite simple: the algorithm first splits the training set into two subsets using a single feature k and a threshold. Once it has successfully split the training set in two, it splits the subsets using the same logic, then the sub-subsets, and so on, recursively. It stops recursing once

it reaches the maximum depth (defined by the max_depth hyperparameter), or
if it cannot find a split that will reduce impurity.

The CART algorithm is a greedy algorithm: it greedily searches for an optimum split at the top level, then repeats the process at each level. it requires O(exp(m)) time, making the problem intractable even for fairly small training sets.

So, the time will take to train another Decision Tree on a training set containing 10 million instances will exponentially increase and if 1 million takes 1 hour, then 10 million would take approx. 10 hours.

6. If your training set contains 100,000 instances, will set presort=True speed up training?

Ans: Making predictions requires traversing the Decision Tree from the root to a leaf. Since each node only requires checking the value of one feature, the overall prediction complexity is just O(log2(m)), independent of the number of features. So, predictions are very fast, even when dealing with large training sets.

For small training sets (less than a few thousand instances), Scikit-Learn can speed up training by presorting the data (set presort=True), but this slows down training considerably for larger training sets.

Reference:https://www.knowledgeisle.com/wp-content/uploads/2019/12/2-Aur%C3%A9lien-G%C3%A9ron-Hands-On-Machine-Learning-with-Scikit-Learn-Keras-and-Tensorflow_-Concepts-Tools-and-Techniques-to-Build-Intelligent-Systems-O%E2%80%99Reilly-Media-2019.pdf

Solution of Chapter 6: Decision Tree

Written by Anjan Debnath