## Description

## 1 Written Problems (50 points)

1. Exploring Asymmetric Regularization in 2D Logistic Regression (Inspired by Exercise 8.7 from

Murphy’s book) (15 points)

(1) Consider the data in Figure 1, where we fit the model

p(y = 1 | x, w) = σ (2w0 + w1x1 + w2x2).

Suppose we fit the model by maximum likelihood, i.e.,

J(w) = −ℓ (w, Dtrain ),

1

Figure 1: Data for logistic regression question

where ℓ (w, Dtrain ) is the log likelihood on the training set. Sketch a possible decision boundary

corresponding to wˆ . (Copy the figure first (a rough sketch is enough), and then superimpose your

answer on your copy, since you will need multiple versions of this figure). Is your answer (decision

boundary) unique? How many classification errors does your method make on the training set?

(2) Now suppose we regularize only the w0 parameter, i.e., we minimize

J0(w) = −ℓ (w, Dtrain ) + λw2

0

.

Suppose λ is a very large number, so we regularize w0 all the way to 0 , but all other parameters

are unregularized. Sketch a possible decision boundary. How many classification errors does your

method make on the training set? Hint: consider the behavior of simple linear regression, 2w0 +

w1x1 + w2x2 when x1 = x2 = 0.

(3) Now suppose we heavily regularize only the w1 parameter, i.e., we minimize

J1(w) = −ℓ (w, Dtrain ) + λ|w1|

Sketch a possible decision boundary. How many classification errors does your method make on the

training set?

(4) Now suppose we heavily regularize only the w2 parameter. Sketch a possible decision boundary. How many classification errors does your method make on the training set?

##
2. Manual Construction of a Non-Linear SVM Classifier (Inspired by Exercise 14.1 in Murphy’s

book) (15 points)

Consider a dataset with two points in 1D: (x1 = −1, y1 = −1) and (x2 = 2, y2 = 1). We will

transform each point into 3D using the mapping ϕ(x) =

1, x, x2

⊤

. This is analogous to employing

a quadratic kernel. The optimal margin classifier is described by:

min ∥w∥

2

subject to

y1

wT ϕ (x1) + w0

≥ 1

y2

wT ϕ (x2) + w0

≥ 1.

(1) Propose a vector that could be perpendicular to the optimal orientation of w in the transformed space.

(2) Calculate the margin accomplished with your suggested w. Remember, the margin reflects

the distance of each support vector to the decision boundary. For guidance, consider the spatial

relationship between two points and a plane dividing them.

(3) Deduce the exact w by leveraging the principle that the margin equates to 1/∥w∥.

## 3. Exploring Advanced Concepts in SVM Classification (10 points)

Given a data set:

Class – 1 :

(1 0)

(0 1)

(−1 0)

(0 −1)

Class + 1 :

(2 0)

(0 2)

(−2 0)

(0 −2)

2

(1) Can you find a svm linear classifier for this data set? If not, how to find a nonlinear classifier?

(2) Explain the concept of the RBF kernel and how it allows for non-linear separation in the

feature space. Discuss its advantages over linear SVM, especially in relation to this specific dataset.

## 4. Analyzing Margin Width in Relation to Dual Variables in SVM (10 points)

In SVM, the margin width γ plays a crucial role in understanding the geometry of the decision

boundary. Given the dual formulation of the SVM optimization problem, we seek to establish a

direct relationship between γ and the Lagrange multipliers {αn}.

Consider a different perspective on the SVM’s optimization, focusing on an alternate representation of the decision function involving a transformation ϕ applied to the input vectors x, such

that the dual problem now reads:

max

α

X

N

n=1

αn −

1

2

X

N

n=1

X

N

m=1

αnαmynymϕ (xn)

⊤

ϕ (xm)

s.t. X

N

n=1

αnyn = 0

αn ≥ 0 ∀n = 1, 2, . . . , N

Your task involves two primary objectives:

(1) Derive an expression for γ in terms of {αn}, assuming that the transformation ϕ influences

3

the inner product between feature vectors. Substantiate your derivation by exploring how ϕ affects

the geometry of the decision boundary.

(2) Prove that the original relationship, 1

γ

2 =

PN

n=1 αn, holds true even under the transformation

ϕ. In your proof, emphasize any new mathematical insights or properties that emerge due to the

involvement of ϕ.

Hint: You might need to revisit the concept of kernels and how they implicitly apply transformations on input vectors, influencing the SVM’s decision function.

## 2 Programming (50 points + 12 points)

2.1 Logistic Regression Implementation: Gradient Descent (8 points)

Task Description In this task, your goal is to complete the gradient descent part of the logistic

regression implementation. The provided code and instructions will be included in a Jupyter notebook. Follow the steps outlined in the notebook and the knowledge covered in the class to complete

the task. Ensure that your code runs correctly, and can visualize the decision boundary using the

provided plotting code.

Requirements Submit the modified Jupyter notebook containing your implemented code. Verify

that the code runs without errors, and plots the decision boundary with the provided code.

## 2.2 Support Vector Machines Implementations (42 points)+ 12 points)

Task Description In this task, your goal is to conduct experiments on SVMs with different kernel

functions and slack variables.

Datasets: You are provided with the training and testing datasets, comprising 120 training data

and 30 testing data. These datasets are derived from the Iris dataset (https://archive.ics.

uci.edu/ml/datasets/iris), which contains three classes (setosa, versicolor, and virginica) of 50

instances each, with each class representing a type of iris plant. Your task is to classify each iris

plant into one of the three possible types.

What you should do: Utilize the SVM function from packages like the scikit-learn (sklearn)

package, that offer various forms of SVM functions. It is recommended to use the sklearn.svm.SVC()1

function.

1

https://scikit-learn.org/stable/modules/generated/sklearn.svm.SVC.html

4

Instructions This task involves training a SVM classifier for each iris type, that is, 3 SVMs for

the 3 distinct iris types. This note is to ensure clarity and prevents confusion with the concept

of multi-class classification. One method for handling the data is provided to facilitate your

entry into the SVM training pipeline (Note that this is not a mandatory format to follow; it is just

an approach provided for students who are not familiar with data processing. You are free to use

any form to write your own pipeline to complete the task):

from sklearn.svm import SVC

y_train=pd.read_csv(’y_train.csv’)[’target’]

y_test=pd.read_csv(’y_test.csv’)[’target’]

# Assume you want to train a SVM for the “setosa” class, that is, label ’0’

class_label = 0

y_train = y_train.apply(lambda x: 1 if x == class_label else -1)

y_test = y_train.apply(lambda x: 1 if x == class_label else -1)

# SVM training code ..

…

(12 Points Bonus: Implementing SVM from Scratch) This bonus challenge encourages

students to delve into the details of SVM implementation, fostering a deeper understanding of the

underlying algorithms and optimization processes. For those students who choose to implement the

SVM functionality without relying on external packages such as scikit-learn, an additional bonus

of 12 points will be awarded. Note that the total score, including the bonus, should not exceed

the maximum score of this assignment only. One recommended approach for solving the SVM dual

solution is the Sequential Minimal Optimization (SMO) method2

. However, you are also free to

explore other generic quadratic programming solvers to achieve this.

2.2.1 (10 points + 3 points)

Calculation using Standard SVM Model (Linear Kernel): Employ the standard SVM model

with a linear kernel. Train your SVM on the provided training dataset and validate it on the testing

dataset. Calculate the classification error for both the training and testing datasets, output the

weight vector w, the bias b, and the indices of support vectors (start with 0).

Note that the scikit-learn package does not offer a function with hard margin, so we will simulate

this using C = 1e5. Print out the results for each different class separately. The output format is

as follows:

Q2.2.1 Calculation using Standard SVM Model:

2

https://pages.cs.wisc.edu/~dpage/cs760/SMOlecture.pdf

5

DDA3020 Machine Learning Autumn 2023, CUHKSZ

setosa training error: xx, testing error: xx,

w_of_setosa: xx, b_of_setosa: xx,

support_vector_indices_of_setosa: xx,

versicolor training error: xx, testing error: xx,

w_of_versicolor: xx, b_of_versicolor: xx,

support_vector_indices_of_versicolor: xx,

virginica training error: xx, testing error: xx,

w_of_virginica: xx, b_of_virginica: xx,

support_vector_indices_of_virginica: xx,

2.2.2 (12 points + 3 points)

Calculate using SVM with Slack Variables (Linear Kernel) For each C = 0.2 × t, where

t = 1, 2, . . . , 5, train your SVM on the provided training dataset, and subsequently validate it on

the testing dataset. Calculate the classification error for both the training and testing datasets,

the weight vector w, the bias b, the indices of support vectors, and the slack variable ξ of support

vectors (you may compute it as max (0, 1 − y · f (X)) . The output format is as follows:

Q2.2.2 Calculate using SVM with Slack Variables (C = 0.2 × t, where t = 1, 2, …, 5):

C: 0.2,

setosa training error: xx, testing error: xx,

w_of_setosa: xx, b_of_setosa: xx,

support_vector_indices_of_setosa: xx,

slack_variable_of_setosa: xx,

versicolor training error: xx, testing error: xx,

w_of_versicolor: xx, b_of_versicolor: xx,

support_vector_indices_of_versicolor: xx,

slack_variable_of_versicolor: xx,

virginica training error: xx, testing error: xx,

w_of_virginica: xx, b_of_virginica: xx,

support_vector_indices_of_virginica: xx,

slack_variable_of_virginica: xx,

—————————————–

C: 0.4 ..

Replace xx with the actual value in your output.

6

2.2.3 (20 points + 6 points)

Calculate using SVM with Kernel Functions and Slack Variables: Conduct experiments

with different kernel functions for SVM, and set C = 1 for all cases. Calculate the classification

error for both the training and testing datasets, the indices of support vectors, and the slack variable

ξ of support vectors for each kernel type:

(a) 2nd-order Polynomial Kernel

(b) 3rd-order Polynomial Kernel

(c) Radial Basis Function Kernel with σ = 1

(d) Sigmoidal Kernel with σ = 1

The output format for each file is as follows:

Q2.2.3 Calculate using SVM with Kernel Functions and Slack Variables:

(a)2nd-order Polynomial Kernel:

setosa training error: xx, testing error: xx,

support_vector_indices_of_setosa: xx,

slack_variable_of_setosa: xx,

versicolor training error: xx, testing error: xx,

support_vector_indices_of_versicolor: xx,

slack_variable_of_versicolor: xx,

virginica training error: xx, testing error: xx,

support_vector_indices_of_virginica: xx,

slack_variable_of_virginica: xx,

—————————————–

(b) 3rd-order Polynomial Kernel:

<…results for (b)…>

—————————————–

(c) Radial Basis Function Kernel with \(\sigma = 1\):

<…results for (c)…>

—————————————–

(d) Sigmoidal Kernel with \(\sigma = 1\):

<…results for (d)…>

Replace xx with the actual value in your output.

Submission Instructions:

For Q2.1:

7

1. Submit your executable A2 yourID Q2.1.ipynb Jupyter notebook.

For Q2.2, you can submit your code in two forms:

1. Place your executable code in a A2 yourID Q2.2.ipynb Jupyter notebook and submit it.

Indicate the corresponding question number in the comment for each cell, and ensure that

your code can logically produce the required results for each question.

2. Submit your executable .py file. In this submission format, you can choose to place all three

sub-questions in one Python file named A2 yourID Q2.py; or you can choose to separate them

into A2 yourID Q2.2.1.py, A2 yourID Q2.2.2.py, and A2 yourID Q2.2.3.py.

Please note that regardless of the method chosen, you need to write clear comments and use

appropriate function/variable names to indicate their corresponding question numbers (Especially

if you decide to submit all 3 questions of Q2.2 in one file). Please note that excessively unreadable

code may result in point deductions.

8