Solution of the exercise:[Chapter-5: Support Vector Machine]

4 min readDec 11, 2021

Chapter 5: SVM

1. What is the fundamental idea behind Support Vector Machines?

A Support Vector Machine (SVM) is a very powerful and versatile Machine Learning model, capable of performing linear or nonlinear classification, regression, and even outlier detection.

The fundamental idea of an SVM classifier is fitting the widest possible street (represented by the parallel dashed lines) between the classes. This is called large margin classification. Notice that adding more training instances “off the street” will not affect the decision boundary at all: it is fully determined (or “supported”) by the instances located on the edge of the street. These instances are called the support vectors (they are circled).

On the contrary, in a Binary classifier, there is a decision boundary where both probabilities are equal to 50% (e.g., what is the probability that this email is spam?). If the estimated probability is greater than 50%, then the model predicts that the instance belongs to that class (called the positive class, labeled “1”), or else it predicts that it does not (i.e., it belongs to the negative class, labeled “0”).

Let’s try to build a classifier to detect the Iris-Virginica type based only on the petal width feature. The petal width of Iris-Virginica flowers (represented by triangles) ranges from 1.4 cm to 2.5 cm, while the other iris flowers (represented by squares) generally have a smaller petal width, ranging from 0.1 cm to 1.8 cm. Notice that there is a bit of overlap.

Above about 2 cm the classifier is highly confident that the flower is an Iris-Virginica (it outputs a high probability to that class), while below 1 cm it is highly confident that it is not an Iris-Virginica (high probability for the “Not Iris-Virginica” class).

In between these extremes, the classifier is unsure. However, if you ask it to predict the class, it will return whichever class is the most likely. Therefore, there is a decision boundary at around 1.6 cm where both probabilities are equal to 50%: if the petal width is higher than 1.6 cm, the classifier will predict that the flower is an Iris-Virginica, or else it will predict that it is not.

So, here the problem is the decision boundary comes so close to the instances that the model will probably not perform as well on new instances.

2. What is a support vector?

The SVM classifier model best fits the widest possible street between the classes. It is fully determined (or “supported”) by the instances located on the edge of the street. These edge instances are called the support vectors.

3. Why is it important to scale the inputs when using SVMs?

SVMs are sensitive to the feature scales. After feature scaling, the decision boundary (the line separator) looks much better.

4. Can an SVM classifier output a confidence score when it classifies an instance? What about probability?

The objective of the SVM classifier is to try to fit the largest possible street between two classes while limiting margin violations.

SVM Regression tries to fit as many instances as possible on the street while limiting margin violations.

An SVM classifier can output the distance between the test instance and the decision boundary, and you can use this as a confidence score.

Unlike Logistic Regression classifiers, SVM classifiers do not output probabilities for each class. Logistic Regression classifiers use 50% probability rate as a decision boundary to predict, Softmax can use less than 50% probability because it uses multiple instances or classes.

5. Should you use the primal or the dual form of the SVM problem to train a model on a training set with millions of instances and hundreds of features?

Although linear SVM classifiers are efficient and work surprisingly well in many cases, many datasets are not even close to being linearly separable. One approach to handling nonlinear datasets is to add more features, such as polynomial features.

6. Say you trained an SVM classifier with an RBF kernel. It seems to underfit the training set: should you increase or decrease γ (gamma)? What about C?

Increasing gamma makes the bell-shaped curve narrower (see the left plot), and as a result, each instance’s range of influence is smaller: the decision boundary ends up being more irregular, wiggling around individual instances. Conversely, a small gamma value makes the bell-shaped curve wider, so instances have a larger range of influence, and the decision boundary ends up smoother.

So, γ acts like a regularization hyperparameter: if your model is overfitting, you should reduce it, and if it is underfitting, you should increase it.

Reference:

https://www.knowledgeisle.com/wp-content/uploads/2019/12/2-Aur%C3%A9lien-G%C3%A9ron-Hands-On-Machine-Learning-with-Scikit-Learn-Keras-and-Tensorflow_-Concepts-Tools-and-Techniques-to-Build-Intelligent-Systems-O%E2%80%99Reilly-Media-2019.pdf

Solution of the exercise:[Chapter-5: Support Vector Machine]

So, here the problem is the decision boundary comes so close to the instances that the model will probably not perform as well on new instances.

Written by Anjan Debnath