This series of practicals focus on the development of machine learning algorithms introduced in the Applied Machine Learning course. In this practical, you will code the K-Nearest Neighbors (KNN) algorithm in Matlab and its applications.
Deadline: November 12, 2024 @ 8pm
Assignments must be turned in by the deadline. 1/6 point will be removed for each day late (a day late starts one hour after the deadline).
Procedure: From the course Moodle webpage, you must download and extract the .zip file named TP3-KNN-Assignment.zip. The folder contains the following elements:
Only the elements in blue need to be filled out. The rest are helper functions or data to be imported. The folder evaluation_functions contains encrypted functions that evaluate the code written and provide feedback on the results. In this assignment, only the evaluation functions for the knnfunction in part 1, i.e. the main algorithm, are provided. For the other functions you can compare your results with the graphs.
You must submit an archive as a .zip file with the following naming convention TP3-KNN-SCIPER.zip, where SCIPER need to be replaced by your own sciper number. You must submit only the files/folders in blue. DO NOT INCLUDE the folders plot_functions, data, utils and evaluation_functions.
We take plagiarism very seriously. Submitting works that are not yours will result in a report submitted to the competent EPFL authorities.
KNN (K-Nearest Neighbors) is a nonparametric "lazy" learning algorithm that can be used for classification or regression. In this assignment we will cover its use for classification purposes only. It is one of the simplest classification algorithms in the machine learning literature, generally used as a baseline for more complex algorithms.
KNN is considered as nonparametric as it does not make any theoretical assumptions of the distribution of the underlying data (e.g. linearly separable, normally distributed, etc.). This makes it applicable to a lot of dataset. It is considered a lazy algorithm because it does not learn any parameters or model to generalize the observed data, it rather uses the full training dataset to find the best outcome for a query point. This is done by keeping the complete training data during testing and using a distance-based majority vote mechanism, to predict a label for a new sample (i.e. query point).
KNN relies on a similar mechanism as DBSCAN, presented in the Applied Machine Learning course, but for supervised learning tasks (i.e. classification). Like DBSCAN, it groups points according to how distant they are from each other using a minimum (K) of points. DBSCAN adds a condition on the minimal distance to detect outliers and merge the clusters. KNN does not have these two additional mechanisms and is hence very sensitive to outliers.
For any type of classification algorithm, one must first split the dataset into training and testing sets. This train/test split is used in order to reduce overfitting with your classifier. By using the full dataset for training, the model or classifier will tend to overfit to the noise of the observed data, rather than the real, underlying model. The data points in your training set are used to tune parameters (i.e. choose the best $k$ for KNN) or learn models of your classifier. The data points on your testing set are considered as a separate dataset used solely to evaluate the performance of your learned classifier.
To partition a dataset for classification one generally selects a validation ratio: valid_ratio, where valid_ratio=0.3 corresponds to $30\%$ of your points used for testing (or validation) and the rest of your points used for training. One can vary the value to have a better estimate of the classifier's performance. In general, validation ratios are tested in the range of $\{30, 40, 50, 60, 70\}$, depending on how much data you have. Special considerations of train/test sets:
The splitting function has already been implemented during the exercise sessions and, therefore, is given in the utils folder.
In classification problems we are given a training dataset $D = \{(\mathbf{x}^1, y^1),(\mathbf{x}^2, y^2),\dots, (\mathbf{x}^M, y^M)\}$ where $\mathbf{x}^i \in \mathbb{R}^{N}$ is the $i$-th $N$-dimensional data point (or feature vector) for $i=1,\dots,M$ and $y^i \in \mathbb{N}$ is the categorical outcome (or class label) corresponding to each data point. Generally, the labels are a sequence of integer ranging from $1$ to $C$, where $C$ is the number of classes (you might also find some datasets labelled from $0$ to $C-1$). The goal is then, given a new sample (or query point) $\mathbf{x}' \in \mathbb{R}^{N}$ we would like to predict its label $\hat{y}' \in \mathbb{N}$\footnote{We use $\hat{.}$ for $\hat{y}$ to indicate that it is an estimated value and not the true label $y$.}. The K-NN algorithm addresses this problem by applying a very simple rule (For further reading, refer to Chapter 4 of the Pattern Classification book by Duda et al.:
$\mathbf{x}'$ is assigned the label $\hat{y}'$ most frequently represented among its $k$ nearest samples.
Hence, the classification decision for $\mathbf{x}'$ involves the following steps:
In other words, the K-NN algorithm begins with a query point $\mathbf{x}'$ and grows a spherical region until it encloses $k$ training points, regardless of their labels. The query point is then labeled by a majority vote from the enclosed training points.
The KNN classifier relies on a metric or "distance" function between the data points in order to accomplish the first step of the algorithm:
Such a distance function has already been implemented in the compute_distance.m function for the TP2-KMeans-Assignment. Although we have three types of distance implemented in this function, for simplicity, during this assignment we will mostly use the Euclidean distance:
which should be called as compute_distance(x1, x2, 'L2'). In Part 2 of the assignment you will also implement another metric for dataset that combines continuous and categorical data. For ease of use, we provide the correct implementation of this function in the utils folder.
Now that we have a distance function, we can start implementing the KNN algorithm. The first step is to compute the pairwise distances between $\mathbf{x}'$ and $D_x$,
\begin{equation} \mathbf{d} = \{d^i(\mathbf{x}',\mathbf{x}^i) \quad | \quad \forall i=1,\dots,M \} \end{equation}where $\mathbf{d} \in \mathbb{R}^M$ is the result of the distance function already implemented and applied on each points of the training dataset.
To extract the k-nearest neighbors, one then chooses the $k$ elements of $\mathbf{d}$ which have the smallest distance. This can be done by first sorting the $\mathbf{d}_{(s)}$ in ascending order
\begin{equation} \mathbf{d}_{(s)} = \{d_{(s)}^i \quad | \quad d_{(s)}^i < d_{(s)}^{i+1} < \dots < d_{(s)}^{M} \} \end{equation}and then selecting the subset of $k$ points and labels that are closest to $\mathbf{x}'$:
\begin{equation} y_{knn} = \{y^{I(d_{(s)}^1)}, y^{I(d_{(s)}^2)}, \dots , y^{I(d_{(s)}^k)}\}, \end{equation}where $I(d_{(s)}^i)$ outputs the index in the original dataset $D$ of the selected point.
Once $y_{knn} = \{ y_{knn}^1, \dots, y_{knn}^k \}$ has been retrieved from the previous step, we can estimate the label of $\mathbf{x}'$ by a majority vote from $y_{knn}$:
\begin{equation} \hat{y}' = argmax(\{ \sum y_{knn} = c_1, \sum y_{knn} = c_2, \dots \sum y_{knn} = c_N \}), \end{equation}where $\{ c_1, c_2, \dots c_N \}$ are the labels associated to each classes $C$. Now we will implement these three steps in the knn.m function.
knn.m function (6pts):¶This function has the following signature:
function [ y_est ] = knn(X_train, y_train, X_test, params)
where X_train and y_train are the features and labels on which to train your model respectively and X_test is a testing dataset for which you estimate the labels and return them in y_est. The variable params is a structure array that contains the hyper-parameters of the algorithm such as the type of the distance function d_type or the number of neighbors k. Accessing a field of this structure array is possible via params.the_field, e.g. params.d_type to get the distance type of params.k to get the number of neighbors.
Implementation Hint: Useful functions: compute_distance, sort and unique.
To evaluate the performance of our classifier, we can estimate the classification accuracy. Given the true test labels $\mathbf{y}'$ and the estimated test labels $\mathbf{\hat{y}}'$ for a test set $D_{test}$ of $M_{test}$ points the accuracy can be computed as follows,
\begin{equation} acc = \frac{\sum_{i=1}^{M_{test}} \delta_{(y_i',\hat{y}_i')} }{M_{test}} \end{equation}where $\delta_{(y_i',\hat{y}_i')} = \begin{cases}\! 1 & \quad \text{if} \quad y_i'=\hat{y}_i' \\ 0 & \quad \text{otherwise} \\ \end{cases}$. In other words, the accuracy is the percentage of correctly classified data points from all points in $D_{test}$.
accuracy.m function (1pt)¶This function has the following signature:
function [acc] = accuracy(y_test, y_est)
By running the tp3_knn_part1.m code block, you will be able to visualize the results of knn andaccuracy functions, e.g. for $k=1$:
plot_dataset(X, y, X_test, y_est, valid_ratio, params);
You can modify $k$ to have the following values $k = \{1, 5, 15, 35, 61, 75\}$, to see how KNN behaves on the same dataset. You must take into consideration that due to the random data split you can have different values of accuracy for the same $K$ and plots that look different. By running the decision-boundary code block, one can visualize the decision boundaries for each output:
params.k = 1;
plot_boundaries(X, y, X_train, y_train, valid_ratio, params)
NOTE: Given the random nature of the split_data function, the following plots don't necessarily have to be the same. In fact, for a high valid_ratio, i.e. valid_ratio $> 0.7$ and a high $K$, i.e. $K>30$ this will undoubtedly be the case.
params.k = 5;
plot_boundaries(X, y, X_train, y_train, valid_ratio, params)
params.k = 20;
plot_boundaries(X, y, X_train, y_train, valid_ratio, params)
params.k = 60;
plot_boundaries(X, y, X_train, y_train, valid_ratio, params)
Until now, we have not addressed the issue of selecting the appropriate $k$ value for our dataset and how it influences the classification result. This can be done by analyzing the $acc$ for a range of $k$ values, the same way we analyzed k-means.
knn_eval.m function (2pts)¶This function has the following signature:
function [acc_curve] = knn_eval( X_train, y_train, X_test, y_test, k_range, params)
The script tp3_knn_part1.m includes a visualization of the accuracy curve for different values of $k$ in k_range
plot_accuracy(params.k_range,acc_curve)
As you can see in the plot, increasing the number of neighbors might simply degradate the performances.
In this second part of the assignment we will apply KNN algorithm on two different datasets. One with high-dimensional continuous data, the Breast-Cancer-Wisconsin (Diagnostic) Dataset that can be found in the UCI Machine Learning Repository
The Breast-Cancer-Wisconsin (Diagnostic) Dataset is composed of $M=698$ data points of $N=9$ dimensions, each corresponding to cell nucleus features in the range of $[1,10]$, with data points belonging to two classes $y \in \{\text{benign}, \text{malignant}\}$. In this part, we will apply KNN classification on this dataset and develop evaluation tools to assess the performances of the classifier.
Load the Breast-Cancer-Wisconsin (Diagnostic) Dataset by running the first block of code in tp3_knn_part2.m. By running the second and third blocks you will split the datasets and evaluate the classification accuracy with the function written previously.
plot_accuracy(params.k_range,acc_curve)
For real high-dimensional datasets, such as the Breast-Cancer-Wisconsin (Diagnostic) Dataset, estimating the classification accuracy is not a sufficient statistic to evaluate the classifiers performance. In these cases, one commonly computes a confusion matrix or error matrix, which is a specific table to visualize the performance in terms of classification of an algorithm. In this table, the rows represents the real classes of the data and the columns the estimated classes. The diagonal represents the well classified examples while the rest indicates confusions. In the case of a binary classifier (i.e. two classes such as the Breast-Cancer-Wisconsin Dataset), 4 values need to be computed to fill the confusion matrix. One of the labels is considered as positive and the other one as negative.
confusion_matrix.m function (2pts)¶This function has the following signature:
function [C] = confusion_matrix(y_test, y_est)
Plotting this confusion matrix provides the following results, where the target class is the true class, and the output class is the estimated class:
plotConfMat(C);
Based on the values of the confusion matrix, one can estimate a graphical representation of the classifier performance, the Receiver Operating Characteristic, so called ROC curve. The ROC curve plots the fraction of true positives and false positives over the total number of samples of class $\mathbf{y}$ (positive,negative) in the dataset. Each point on the curve corresponds to a different value of the classifier's parameter (e.g. $k$).
To compute such a curve, one needs to compute the TPR (True Positive Rates), otherwise known as sensitivity or recall and FPR (False Positive Rates), these are computed as follows:
Where $P$ are all positive values ($\mathbf{y = 1}$) and $N$ are all negative values ($\mathbf{y = 2}$)
knn_ROC.m function (2pts)¶This function has the following signature:
function [ TP_rate, FP_rate ] = knn_ROC( X_train, y_train, X_test, y_test, params )
It shall compute the values for the ROC curve of K-NN for a range of k-values. It must return two $(1 \times K)$ vectors, the first vector is the TPR of the classifier for each value of $k$ and the second vector is the FPR. The columns correspond to the k-values.
plot_roc(FP_rate, TP_rate, params)
One can observe that the obtained ROC curve is very sensitive to the train dataset that is extracted by the function split_data. This observation is particularly true for KNN, because each example of the train set is directly used for classification. To assess the performances of the algorithm in a more robust way, a solution is to use cross-validation.
"Cross-validation refers to the practice of confirming an experimental finding by repeating the experiment using an independent assay technique" (Wikipedia). In Machine Learning, this consists of splitting the dataset several times at random into training and validation sets. The number of repetitions is referred to as the number of folds $F$. When one proceeds to 10 such random draws of training and testing sets, one says that one performs 10-fold cross-validation.
Here, we will use the function split_data to split the data $F$ times at random between train and test while keeping the same valid_ratio. By averaging the resulting ROC curve across $F$, we will obtain a more robust assessment of the performance of our algorithm.
We will also compute the standard deviation $\sigma$ across cross-validation runs. The standard deviation shows how the computed averages can vary (note that in normal distributions, $68.2\%$ of the distribution takes place between $\pm \sigma$).
cross_validation.m function (4pts)¶This function has the following signature:
function [avgTP, avgFP, stdTP, stdFP] = cross_validation(X, y, F_fold, valid_ratio, params)
Plotting the average and errorbars gives the following representation:
plot_cross_validation(avgTP, avgFP, stdTP, stdFP, params)