Solution of the Exercise Chapter 7: Ensemble learning and Random Forest

Anjan Debnath
5 min readDec 21, 2021
Ensemble prediction

Ensemble learning

A group of predictors (eg: models such as classifiers) is called an ensemble; thus, this technique is called Ensemble Learning, and an Ensemble Learning algorithm is called an Ensemble method.

You can train a group of Decision Tree classifiers, each on a different random subset of the training set.

To make predictions, you just obtain the predictions of all individual trees, then predict the class that gets the most votes. Such an ensemble of Decision Trees is called a Random Forest, and despite its simplicity, this is one of the most powerful Machine Learning algorithms available today.

Random Forests are very handy to get a quick understanding of what features actually matter, in particular if you need to perform feature selection.

1. If you have trained five different models on the exact same training data, and they all achieve 95% precision, is there any chance that you can combine these models to get better results? If so, how? If not, why?

Suppose you have a slightly biased coin that has a 51% chance of coming up heads, and 49% chance of coming up tails. If you toss it 1,000 times, you will generally get more or less 510 heads and 490 tails, and hence a majority of heads.

f you do the math, you will find that the probability of obtaining a majority of heads after 1,000 tosses is close to 75%. The more you toss the coin, the higher the probability (e.g., with 10,000 tosses, the probability climbs over 97%).

This is due to the law of large numbers: as you keep tossing the coin, the ratio of heads gets closer and closer to the probability of heads (51%).

However, this is only true if all classifiers are perfectly independent, making uncorrelated errors, which is clearly not the case since they are trained on the same data.

So, YES and this is due to the law of large numbers.

2. What is the difference between hard and soft voting classifiers?

A very simple way to create an even better classifier is to aggregate the predictions of each classifier and predict the class that gets the most votes. This majority-vote classifier is called a hard voting classifier. Somewhat surprisingly, this voting classifier often achieves higher accuracy than the best classifier in the ensemble. (Based on random voting)

If all classifiers are able to estimate class probabilities (i.e., they have a pre dict_proba() method), then you can tell Scikit-Learn to predict the class with the highest class probability, averaged over all the individual classifiers. This is called soft voting. So, all classifiers can estimate class probabilities. (Based on averaging all the individual classifiers)

So, Soft voting often achieves higher performance than hard voting because it gives more weight to highly confident votes.

3. Is it possible to speed up the training of a bagging ensemble by distributing it across multiple servers? What about pasting ensembles, boosting ensembles, random forests, or stacking ensembles?

Predictors can all be trained in parallel, via different CPU cores or even different servers. Similarly, predictions can be made in parallel. This is one of the reasons why bagging and pasting are such popular methods: they scale very well.

So, YES.

4. What is the benefit of out-of-bag evaluation?

One way to get a diverse set of classifiers is to use very different training algorithms. Another approach is to use the same training algorithm but to train them on different random subsets of the training set.

When sampling is performed with replacement, this method is called bagging (short for bootstrap aggregating2).

When sampling is performed without replacement, it is called pasting.

training set sampling

With bagging, some instances may be sampled several times for any given predictor (Eg: classifier), while others may not be sampled at all. By default, a BaggingClassifier samples m training instances with replacement (bootstrap=True), where m is the size of the training set.

This means that only about 63% (As m grows, this ratio approaches 1 — exp(–1) ≈ 63.212%.)of the training instances are sampled/trained on average for each predictor(Eg: classifier).

The remaining 37% of the training instances that are not sampled are called out-of-bag (oob) instances. Note that they are not the same 37% for all predictors.

So, During Bagging, since a predictor never sees the oob instances during training, it can be evaluated on these instances, without the need for a separate validation set. You can evaluate the ensemble itself by averaging out the oob evaluations of each predictor.

5. What makes Extra-Trees more random than regular Random Forests? How can this extra randomness help? Are Extra-Trees slower or faster than regular Random Forests?

Sampling both the training instances and features are called the Random Patches method.

Keeping all training instances (i.e., bootstrap=False and max_samples=1.0) but only sampling the features (i.e., bootstrap_features=True and/or max_features smaller than 1.0) is called the Random Subspaces method.

a Random Forest is an ensemble of Decision Trees, generally trained via the bagging method (or sometimes pasting), typically with max_samples set to the size of the training set. The Random Forest algorithm introduces extra randomness when growing trees.

Extra Tree: When you are growing a tree in a Random Forest, at each node of the Decision Tree only a random subset of the features is considered for splitting (trained via bagging method). It is possible to make trees even more random by also using random thresholds for each feature rather than searching for the best possible thresholds.

So, by using the random thresholds for each feature rather than searching for the best possible thresholds Extra-Trees become more random. It also makes Extra-Trees much faster to train than regular Random Forests since finding the best possible threshold for each feature at every node is one of the most time-consuming tasks of growing a tree.

6. If your AdaBoost ensemble underfits the training data, what hyperparameters should you tweak and how?

AdaBoost tweaking the instance weights at every iteration.

If your AdaBoost ensemble is overfitting the training set, you can try reducing the number of estimators or more strongly regularizing the base estimator.

--

--