This series of practicals focus on the development of machine learning algorithms introduced in the Applied Machine Learning course. In this practical, you will code the K-Means algorithm in Matlab. You will then evaluate the algorithm performances on a high-dimensional dataset to find optimal paramters and create an application for it as a recommendation system.
Deadline: October 22, 2024 @ 8pm
Assignments must be turned in by the deadline. 1/6 point will be removed for each day late (a day late starts one hour after the deadline).
Procedure: From the course Moodle webpage, you must download and extract the .zip file named TP2-KMEANS-Assignment.zip. The folder contains the following elements:
Only the elements in blue need to be filled out. The rest are helper functions or data to be imported. The folder evaluation_functions contains encrypted functions that evaluate the code written and provide feedback on the results for part 1 and part 2 of the assignment. The evaluation function are ONLY provided for fucntions that implements the theory behind the algorithms. There are no evaluation function for kmeans_eval and f1measure_eval functions in part 2 and none for part3.
You must submit an archive as a .zip file with the following naming convention TP2-KMEANS-SCIPER.zip, where SCIPER need to be replaced by your own sciper number. You must submit only the files/folders in blue. It is unnecessary to include the folders plot_functions, data and evaluation_functions.
We take plagiarism very seriously. Submitting works that are not yours will result in a report submitted to the competent EPFL authorities.
Clustering is a process of partitioning a set of data (or objects) in a set of meaningful sub-classes, called clusters. The k-means algorithm is a widely used clustering technique that seeks to minimize the average squared distance between points in the same cluster. In other words, given a dataset $\mathbf{X} = \{\mathbf{x}^1,\dots,\mathbf{x}^M\}$ where $\mathbf{x}^i \in \mathbb{R}^N$ with $M$ samples of $N$ dimensions, we can describe it with $K$ clusters, each represented by a centroid $\mu^k \in \mathbb{R}^N$, by minimizing the total squared distance between each point and its closest centroid with the following cost function,
\begin{equation} \label{eq:J} J(\mu_1, \dots, \mu_k) = \sum\limits_{k=1}^{K} \sum\limits_{\mathbf{x}^i \in c_k}|\mathbf{x}^i - \mathbf{\mu^k}|^2 \end{equation}where $c_k \in \{1,\dots,K\}$ is the cluster label. This is an NP-hard problem, however, the k-means algorithm provides a local search solution for through an iterative procedure proposed in the 1980's by Stuart P. Lloyd.
For a deeper description of K-Means and its applications, refer to the Clustering slides from the Applied Machine Learning Course. In this assignment, we will follow the steps of the k-Means algorithm shown on slide 25 of the class. Given a fixed number of $K$ clusters we will go through the following steps:
thres=1e-4 or the maximum number of iterations (MaxIter) is reached.In the k-means algorithm, the initialization of the centroids $\mu^k$ is one of the most important steps, if the centroids are initialized far away from the expected cluster centers, the algorithm might not be able to reach the expected solution. Hence, several methods have been proposed to initialize the centroids. In this assignment, we should implement two initialization methods:
kmeans_init.m function (4pts)¶The function has the following signature:
function [Mu] = kmeans_init(X, k, init)
The input variable init tells which initialization to perform. Its value is a string that can be set to either "sample" or "range". The function should return the initialized clusters in Mu according to the provided value in init.
Useful functions randsample, datasample, min, max, range, rand
Many clustering algorithms (specifically k-means) rely on a metric or "distance" function between the data points. While one typically uses the Euclidean distance as a metric in $\mathbb{R}^N$, it actually comes from a more general class of metrics, i.e. the Minkowski metric:
\begin{equation*} L_p(\mathbf{x},\mathbf{x}') = \begin{pmatrix} \sum_{i=1}^{N} |x_i - x_i'|^p \end{pmatrix}^{1/p}, \end{equation*}otherwise known as the $L_P$ norm. From this norm, we can recover three important distances used in the machine learning applications. For example, when $p=1$ we recover the $L_1$ norm, also known as the Manhattan distance:
\begin{equation} \label{eq:l1} d_{L_1}(\mathbf{x},\mathbf{x}') = ||\mathbf{x} - \mathbf{x}'||_1 = \sum_{i=1}^{N} |x_i - x_i'|, \end{equation}this distance gives the shortest path between $\mathbf{x}$ and $\mathbf{x}'$ on a grid aligned to the data's coordinate system, also called the grid or rectilinear distance. When $p=2$ we get the $L_2$ norm otherwise known as the Euclidean distance, the most common metric in literature for $\mathbb{R}^N$:
\begin{equation} \label{eq:l2} d_{L_2}(\mathbf{x},\mathbf{x}') = ||\mathbf{x} - \mathbf{x}'||_2 = \sqrt{\sum_{i=1}^{N} |x_i - x_i'|^2}. \end{equation}The Euclidean distance measures the distance between two points as the length of a straight line between them. Finally, $p=\infty$ yields the $L_{\infty}$ norm:
\begin{equation} \label{eq:linf} d_{L_\infty}(\mathbf{x},\mathbf{x}') = ||\mathbf{x} - \mathbf{x}'||_{\infty} = \underset{i}{\text{max}}|x_i - x_i'|, \end{equation}which measures the maximum distance between dimensions.
compute_distances.m function (3pts)¶The function has the following signature:
function [d] = compute_distance(x_1, x_2, type)
It should include $L_1$, $L_2$ and $L_{\infty}$ distances computation as described previously. The function should compute either of these equations depending on the type variable provided, which is a string and can accept the following values type = "L1", "L2", or "LInf".
distance_to_centroids.m function (2pts)¶The function has the following signature:
function [d] = distance_to_centroids(X, Mu, type)
It should compute a $\mathbf{d} \in \mathbb{R}^{K \times N}$ matrix of distance from all points in $\mathbf{X}$ to the $K$ centroids $\mu = \{\mu^1, \ldots, \mu^K\}$ using the compute_distances function.
After calculating the distances from each point of $\mathbf{X}=\{\mathbf{x}^1, \dots, \mathbf{x}^M\}$ to each centroid of $\mathbf{\mu}=\{\mu^1,\dots,\mu^K\}$, we should assign points to their closest cluster centroids. Algorithmically, we do this by using the concept of responsibility; i.e. the cluster centroid $\mu^k$ closest to a data point $\mathbf{x}^i$ is responsible for that data point, the closest centroid can be found as follows,
\begin{equation} k_i = \underset{k}{\mathrm{argmin}} \{d(\mathbf{x}^i, \mathbf{\mu}^k) \}, \quad \end{equation}where $k_i \in \{1,\dots,k\}$ is the id or label of the closest cluster. The responsibility of each $k$-th cluster for data point $\mathbf{x}^i$ can then be computed as follows,
\begin{equation} r_i^k = \begin{cases}\!% alignment adjustment 1 & \quad \text{if} \quad k_i = k\\ 0 & \quad \text{otherwise} \\ \end{cases} \end{equation}If a tie happens (i.e. two centroids are equidistant to a data point), one assigns the data point to the centroid with the smallest winning cluster size.
We must then adjust the centroids of $C_k$ to be the means of all data points assigned to them, this is called the maximization step and is computed as follows:
\begin{equation} \mathbf{\mu}^k = \frac{\sum\limits_{i} r_i^k \mathbf{x}^i}{\sum\limits_{i} r_i^k} \end{equation}We then go back to step 2 and repeat the process until the clusters are stable (i.e. there is no change between $\mu^{(t)}$ and $\mu^{(t-1)}$ where $t$ is the iteration, for a certain number of steps given by the variable MaxTolIter) or the maximum MaxIter is reached.
check_convergence.m function (1pts)¶The function has the following signature:
function [has_converged, tol_iter] = check_convergence(Mu,Mu_previous,iter,tol_iter,MaxIter,MaxTolIter,tolerance)
It should return true in has_converged if either one of the two following conditions is met:
iter has passed the maximum number of iterations MaxItertol_iter for which $\mu^{(t)} \approx \mu^{(t−1)}$ has passed the maximum number of iterations MaxTolIterIf there is no convergence but $\mu^{(t)} \approx \mu^{(t−1)}$ is verified, then tol_iter should be incremented. tolerance is used to check the approximate equality between $\mu^{(t)}$ and $\mu^{(t−1)}$ as strict equality should never be used due to numerical approximations.
kmeans.m function (4pts)¶The function has the following signature:
function [labels, Mu, Mu_init, iter] = kmeans(X, K, init, type, MaxIter, plot_iter)
plot_iter is a plotting boolean which if set to true plots the initial centroids, the first and the final iterations of K-Means, for a 2D dataset. This is given to you in the function template. You should only fill in the required k-means steps. The function should return to which cluster each point is associated with in labels, i.e. labels should contain the index of the associated clusters for each point of the dataset.
In the main loop of the algorithm you should consider the possible case where a cluster is empty, i.e. no points have been assigned to it. To handle this, you should reassign clusters randomly using the kmeans_init function and reset the number of iterations iter until the situation is resolved.
Due to the random nature of the algorithm, the evaluation function evaluate_kmeans, compare your implementation with the solution multiple times with different random initialization. Therefore, the score printed might fluctuate, and should not be considered as absolute truth.
When running the script tp2_kmeans_part1.m, it loads a 2D dataset made by sampling Gaussians:
plot_2d_data(X, gmm)
Running the kmeans function with $L_1$ distance function and the range initialization yields the following results:
K = 4; init='range'; type='L1'; MaxIter = 100; plot_iter = 1;
[labels, Mu, ~] = kmeans(X, K, init, type, MaxIter, plot_iter);
Note that due to the random nature of the initialization you might obtain different plots. After that we can plot the boundaries resulting from the clustering:
plot_boundaries(X, Mu, labels, K, type)
Now that you have correctly implemented the K-Means algorithm, you can apply it on a larger dataset to find the optimal number of clusters based on metrics. You will also create an unsupervised recommendation system in part 3 of the assignment.
In this part of the assignment, you will define some metrics that will help you choosing the optimal number of clusters. You will then apply it to a high dimensionnal dataset.
How do we determine the best clustering? For this, we can use a set of metrics which are used to evaluate clustering performance. Following we provide the equations of the three clustering metrics that you will implement for k-means. For more explanations on the metrics, refer to the slides on clustering evaluation metrics from the Advanced Machine Learning course.
where $B=(K*N)$ for $K$ clusters and $N$ dimensions.
where $B$ is as before and $M$ is the total number of datapoints.
compute_metrics.m function (3pts)¶The function has the following signature:
function [RSS, AIC, BIC] = compute_metrics(X, labels, Mu)
To choose the optimal $K$ for your dataset, one can run K-Means by increasing monotonically the number of clusters and plotting the results. Since the output of K-Means is not deterministic, we should run K-Means for repeats times, generally 10, and take the mean or best of those runs as the result, in this implementation we will take the mean.
For the $RSS$ metric, the elbow or plateau method can be reliable for certain datasets. This method finds the optimal $K$ at the elbow of the curve, i.e. when the values stop decreasing rapidly. For the $AIC$ and $BIC$ metrics, the optimal $K$ is found as the minimum value in the curve.
kmeans_eval.m function (3pts)¶The function has the following signature:
function [RSS_curve, AIC_curve, BIC_curve] = kmeans_eval(X, K_range, repeats, init, type, MaxIter)
Remember to run the K-Means call repeats time, store the values of the metrics for each iteration and return the mean values of each metrics over the number of repetition.
In this part we will use K-Means to cluster a real, high dimensional dataset. We will use the Digits dataset which has $K=10$ clusters and is in a 64-dimensional space. Further, we will find the F1-measure on the clustered data to evaluate how well it is clustered.
The digits dataset is loaded in the first block of the script, tp2_kmeans_part2.m. We can change the number of classes to be loaded from the data by varying true_K $\in [1,10]$. A subset of the individual samples of the handwritten data can be visualized by running the second block of code in the script.
plot_images(X);
It can also be visualized as a scatter plot where the first eight dimensions are displayed.
plot_dimensions(X, labels, [1,2,3,4,5,6,7,8]);
By using the kmeans_eval function created previously we can estimate what would be the optimal number of clusters for this dataset:
plot_eval_curves(RSS_curve, AIC_curve, BIC_curve);
As we can see, it is hard to identify the optimal $K$ for this dataset. Moreover, due to its high dimensionality, the result of a single K-Means run cannot be visualized. Hopefully, one can use some dimensionnality reduction technique such as PCA to later cluster in a lower dimensional space
One of the advantages of PCA is not only that it projects the dataset to a lower dimensional space, but also, since it projects the dataset onto a new basis (guided by maximum variance) it can make the dataset more linearly separable and hence easier to cluster! By running the third code block of tp2_kmeans_part2.m, we will apply PCA to the digits dataset and project it to a 4-D sub-space. Then we evaluate K-means on this lower-dimensional space.
plot_projection(Yproj, labels)
As can be seen, the clustering metrics on the lower-dimensional space seem to yield a more decisive optimal $K$.
plot_eval_curves(RSS_curve, AIC_curve, BIC_curve);
When a subset of labels are available for the dataset we are clustering, we can apply an external metric to evaluate the results of our clustering algorithm. The external metric we will use in this practical is the F1-measure. For more details on this, refer to the slides on evaluation of clustering methods from the Applied Machine Learning Course.
The F1-measure can be estimated by using the following equations:
\begin{equation} F_1(C,K) = \sum_{c_i \in C} \frac{|c_i|}{M} \underset{k}{\text{max}} \{F_1(c_i,k)\} \end{equation}\begin{equation} F_1(c_i,K) = \frac{2R(c_i,k) P(c_i,k)}{R(c_i,k) + P(c_i,k)} \end{equation}\begin{equation} R(c_i,k) = \frac{n_{ik}}{|c_i|} \end{equation}\begin{equation} P(c_i,k) = \frac{n_{ik}}{|k|} \end{equation}$M$ = number of labeled datapoints $C$ = set of classes $K$ = number of clusters $n_{ik}$ = number of members of $c_i$ and of cluster $k$
f1measure.m function (4pts)¶The function has the following signature:
function [F1_overall, P, R, F1] = f1measure(cluster_labels, class_labels)
From the clustered labels and the ground truth labels in cluster_labels and class_labels respectively it should return the four components of the F1-measure as calulated with the above equations.
f1measure_eval.m functiom (1pts)¶The function has the following signature:
function [F1_curve] = f1measure_eval(X, K_range, repeats, init, type, MaxIter, true_labels)
It works similarly to the kmeans_eval function but this time for the F1-measure.
Finally, we can calculate the F1-measure on the dataset, starting with the original data before projection:
plot_f1measure_curve(F1_curve);
And also on the data after projection:
plot_f1measure_curve(F1_curve);
This shows that for this dataset, the optimal number of cluster is $5$ according to the F1-measure. Interestingly, with this measure, the optimal value can directly be obtained from clustering on the original data.
Recommendation systems are everywhere now, from the platform of video streaming that proposes you to watch a new show, to the posts you see on social media. They all have something in common, there are supposed to be personnalized and adapted to your taste.
In this last part we will construct such a recommendation system to help you build decks in Magic the Gathering (MTG) trade card game. The interest for this game lies in the fact that there is a huge amount of different cards to choose from, and a similarly huge base of players to take inspiration from. Do not worry if you have never heard of the game before, as k-means is unsupervised we can even do that without having a clue about the structure of the game or even what cards look like. The only thing we need to know is that each card has a name and that two different cards do not have the same name. We will just look at the cards that top player choose hoping that they know what are the good associations between them.
First we will load a dataset of decks from mtgProject Github repository. The dataset contains $777$ decks from top MTG players. A deck is a list of $60$ cards, where a card can only appear in $4$ examplary in the same deck. If you look at the loaded data table you will notice that it is a cell array containing strings and, as you know, strings are difficult to handle. Therefore the first step will be to rearrange the data.
In order to be able to apply K-Means on the dataset, we first need to prepare the data by transforming the cell array as a numerical matrix. We will go even further by applying a special transformation.
Our interest is to be able to recommend cards to complete a partially filled deck. We will first transform the dataset to have for each cards its frequency of appearance in the different decks. For example, if we have $10$ decks and a card appears $3$ times in the second deck and $1$ time in the seventh, the corresponding row for that card will be:
[0,3,0,0,0,0,1,0,0,0]
At the end we should obtain a $N \times M$ matrix where $N$ is the number of unique cards in the dataset and $M$ the number of decks, i.e. $777$ here.
prepare_data.m function (2pts)¶The function has the following signature:
function [X, unique_cards] = prepare_data(data)
The function should return both the transformed data in X and the array of unique card names in unique_cards.
The next step is to apply K-Means on the previously transformed dataset. We will arbitrary choose $K = 8$. We obtain $8$ clusters corresponding to $8$ different archetype of decks. Applying K-Means on the dataset gives the set of centroids $\mu^k$, core of the recommendation system.
[labels, Mu, ~] = kmeans(X, K, init, type, MaxIter, plot_iter);
The next step is to extract a recommended card from the centroids of the clusters. Let us assume that we have a deck containing for now only one 367 card. To represent this, we create a deck as vector of size $N$, full of zeros except for the card of interest.
deck = zeros(length(unique_cards))
deck(367) = 1
In case we have multiple times the same card, the line contains the number of times we own that card.
deck = zeros(length(unique_cards))
deck(367) = 2
deck(48) = 1
For recommending a card, we should look at the centroids $\mu^k$ at the rows corresponding to the card in our deck and select the cluster
We should look at the row corresponding to this card in $\mu^k$ and select the cluster for which the value is the closest to $1$.
In this case the distance should be the sum of the distance for each card. The easiest way to achieve this would be to use the distance_to_centroids function implemented earlier. However it cannot be applied directly on the deck vector as this vector represent a partially filled deck. In distance_to_centroids, we assume that the sample comes from the same distribution as the cluster, i.e. that we have a full deck. This is not the case in our situation. Therefore, we need to create a new distance function.
deck_distance.m function (2pts)¶The function has the following signature:
function [dist] = deck_distance(deck, Mu, type);
You can see this new distance function as the distance with subset of the centroids where the subset corresponds to non-zero cards. You should reuse the distance_to_centroids function but be careful to correctly perform it on a subset of deck and Mu.
Thanks to the previous distance function we can now estimate the archetype of our ongoing deck, i.e. the closest cluster. From this cluster, we can recommend any other cards. There is not much we can do without adding prior knowledge so we will just sort the cards in descending order based on the values in the chosen $\mu^k$.
recommend_cards.m function (2pts)¶The function has the following signature:
function [cards] = recommend_cards(deck, Mu, type)
In this function you should call the deck_distance function implemented previously to get the closest cluster $k$ . From this cluster, you can extract all the ids of the cards where $\mu^k \neq 0$ and sort them in descending values of $\mu^k$.
Finally in the rest of the script tp2_kmeans_part3.m we simply add the top recommended card to the deck until we reach the limit of $60$ cards knowing that we can only add $4$ times the same card. For a deck with a 612 and two 48 cards at initial stage this gives the following result:
display_deck(deck, unique_cards)
'Arcbound Ravager'
'Beast Within'
'Blackcleave Cliffs'
'Bloodbraid Elf'
'Bloodstained Mire'
'Dark Confidant'
'Forest'
'Inquisition of Kozilek'
'Lightning Bolt'
'Liliana of the Veil'
'Mountain'
'Mox Opal'
'Noble Hierarch'
'Razorverge Thicket'
'Serum Visions'
'Stomping Ground'
'Swamp'
'Tarmogoyf'
'Verdant Catacombs'
'Voice of Resurgence'
'Wooded Foothills'