Solution of the exercises [Chapter-4: Training Models]: Hands-On-Machine-Learning-with-Scikit-Learn-Keras-and-Tensorflow

7 min readNov 24, 2021

Chapter-4: Training Models

Photo by Possessed Photography on Unsplash

1. What Linear Regression training algorithm can you use if you have a training set with millions of features?

Both the Normal Equation and the Singular Value Decomposition (SVD) approach get very slow when the number of features grows large (e.g.100,000). However, Gradient Descent scales well with the number of features; training a Linear Regression model when there are hundreds of thousands of features is much faster-using Gradient Descent than using the Normal Equation or SVD decomposition.

2. Suppose the features in your training set have very different scales. What algorithms might suffer from this, and how? What can you do about it?

If an algorithm is not using the feature scaling method, then it can consider the value 3000 meters to be greater than 5 km but that’s actually not true and, in this case, the algorithm will give wrong predictions. So, we use Feature Scaling to bring all values to the same magnitudes and thus, tackle this issue.

The general idea of Gradient Descent is to tweak parameters iteratively in order to minimize a cost function. Concretely, you start by filling θ with random values (this is called random initialization), and then you improve it gradually, taking one baby step at a time, each step attempting to decrease the cost function (e.g., the MSE) until the algorithm converges to a minimum.

When using Gradient Descent, you should ensure that all features have a similar scale (e.g., using Scikit-Learn’s StandardScaler class), or else it will take much longer to converge.

3. Can Gradient Descent get stuck in a local minimum when training a Logistic Regression model?

If the learning rate is too small, then the algorithm will have to go through many iterations to converge, which will take a long time. On the other hand, if the learning rate is too high, you might jump across the valley and end up on the other side, possibly even higher up than you were before. This might make the algorithm diverge, with larger and larger values, failing to find a good solution.

Two main challenges with Gradient Descent: if the random initialization starts the algorithm on the left, then it will converge to a local minimum, which is not as good as the global minimum.

If it starts on the right, then it will take a very long time to cross the plateau, and if you stop too early you will never reach the global minimum.

Gradient Descent is guaranteed to approach arbitrarily close the global minimum (if you wait long enough and if the learning rate is not too high). So, it will not be stuck at the local minimum.

4. Do all Gradient Descent algorithms lead to the same model provided you let them run long enough?

5. Suppose you use Batch Gradient Descent and you plot the validation error at every epoch. If you notice that the validation error consistently goes up, what is likely going on? How can you fix this

When the model is trained on very few training instances, it is incapable of generalizing properly, which is why the validation error is initially quite big. So, it means the model is overfitting.

One way to improve an overfitting model is to feed it more training data until the validation error reaches the training error.

If your model is underfitting the training data, adding more training examples will not help. You need to use a more complex model or come up with better features.

6. Is it a good idea to stop Mini-batch Gradient Descent immediately when the validation error goes up?

One way to improve an overfitting model is to feed it more training data until the validation error reaches the training error. So, stopping Mini-batch Gradient Descent immediately is a good idea and it is better to use Batch GD as it takes all the training set rather than a randomized training set.

7. Which Gradient Descent algorithm (among those we discussed) will reach the vicinity of the optimal solution the fastest? Which will actually converge? How can you make the others converge as well?

With Stochastic and Mini-batch Gradient Descent, the curves are not so smooth, and it may be hard to know whether you have reached the minimum or not. So, Stochastic and Mini-batch Gradient Descent will reach the vicinity of the optimal solution but the fastest is Stochastic GD.

When the cost function is convex and its slope does not change abruptly (as is the case for the MSE cost function), Batch Gradient Descent with a fixed learning rate will eventually converge to the optimal solution, but you may have to wait a while. When the cost function (below image) is very irregular, this can actually help the algorithm jump out of local minima, so Stochastic Gradient Descent has a better chance of finding the global minimum than Batch Gradient Descent does. One solution to this dilemma is to gradually reduce the learning rate. The steps start out large (which helps make quick progress and escape local minima), then get smaller and smaller, allowing the algorithm to settle at the global minimum.

8. Suppose you are using Polynomial Regression. You plot the learning curves and you notice that there is a large gap between the training error and the validation error. What is happening? What are three ways to solve this?

This means that the model performs significantly better on the training data than on the validation data, which is the hallmark of an overfitting model. However, if you used a much larger training set, the two curves would continue to get closer.

Three ways to solve: Ridge Regression, Lasso Regression, and Elastic Net, which implement three different ways to constrain the weights.

9. Suppose you are using Ridge Regression and you notice that the training error and the validation error are almost equal and fairly high. Would you say that the model suffers from high bias or high variance? Should you increase the regularization hyperparameter α or reduce it?

Both curves have reached a plateau; they are close and fairly high. The model suffers from high bias.

Increasing a model’s complexity will typically increase its variance and reduce its bias. Conversely, reducing a model’s complexity increases its bias and reduces its variance.

The hyperparameter α controls how much you want to regularize the model. If α = 0 then Ridge Regression is just Linear Regression.

Increasing α leads to flatter (i.e., less extreme, more reasonable) predictions; this reduces the model’s variance but increases its bias. So, reducing hyperparameter α will reduce the bias.

10. Why would you want to use:
• Ridge Regression instead of plain Linear Regression (i.e., without any regularization)?
• Lasso instead of Ridge Regression?
• Elastic Net instead of Lasso?

This forces the learning algorithm to not only fit the data but also keep the model weights as small as possible.
An important characteristic of Lasso Regression is that it tends to completely eliminate the weights of the least important features (i.e., set them to zero).
Elastic Net is preferred over Lasso since Lasso may behave erratically when the number of features is greater than the number of training instances or when several features are strongly correlated.

11. Suppose you want to classify pictures as outdoor/indoor and daytime/nighttime. Should you implement two Logistic Regression classifiers or one Softmax Regression classifier?

Logistic Regression (also called Logit Regression) is commonly used to estimate the probability that an instance belongs to a particular class (e.g., what is the probability that this email is spam?).

If the estimated probability is greater than 50%, then the model predicts that the instance belongs to that class (called the positive class, labeled “1”), or else it predicts that it does not (i.e., it belongs to the negative class, labeled “0”). This makes it a binary classifier.

The logistic — noted σ(・) — is a sigmoid function (i.e., S-shaped) that outputs a number between 0 and 1. Once the Logistic Regression model has estimated the probability p = hθ(x) that an instance x belongs to the positive class, it can make its prediction ŷ easily

Logistic Regression model prediction

y =

0 if p < 0 . 5

1 if p ≥ 0 . 5

SoftMax Regression: The Softmax Regression classifier predicts only one class at a time (i.e., it is multiclass, not multioutput) so it should be used only with mutually exclusive classes such as different types of plants. You cannot use it to recognize multiple people in one picture. Let’s use Softmax Regression to classify the iris flowers into all three classes.

Notice that the model can predict a class that has an estimated probability below 50%. For example, at the point where all decision boundaries meet, all classes have an equal estimated probability of 33%.

So, since I want to classify pictures as outdoor/indoor and daytime/nighttime that means whether 1 or 0 classifier is enough that means I can use binary classifier aka Logistic Regression.

Reference:

https://www.knowledgeisle.com/wp-content/uploads/2019/12/2-Aur%C3%A9lien-G%C3%A9ron-Hands-On-Machine-Learning-with-Scikit-Learn-Keras-and-Tensorflow_-Concepts-Tools-and-Techniques-to-Build-Intelligent-Systems-O%E2%80%99Reilly-Media-2019.pdf

Solution of the exercises [Chapter-4: Training Models]: Hands-On-Machine-Learning-with-Scikit-Learn-Keras-and-Tensorflow

Chapter-4: Training Models

Written by Anjan Debnath