This series of practicals focus on the development of machine learning algorithms introduced in the Applied Machine Learning course. In this first practical, you will code the Principal Component Analysis (PCA) algorithm in Matlab. You will then use the code to reduce the dimensionality of the Fisher-Iris dataset, use it to compress an image, and apply it on real data.
Deadline: October 8, 2024 @ 8pm
Assignments must be turned in by the deadline. 1 point will be removed for each day late (a day late starts one hour after the deadline).
Procedure: From the course Moodle webpage, you must download and extract the .zip file named TP1-PCA-Assignment.zip. The folder contains the following elements:
Only the elements in blue need to be filled out. The rest are helper functions or data to be imported. The folder evaluation_functions contains encrypted functions that evaluate the code written and provide feedback on the results for part 1 of the assignment.
You must submit an archive as a .zip file with the following naming convention TP1-PCA-SCIPER.zip, where SCIPER need to be replaced by your own sciper number. You must submit only the files/folders in blue. It is unnecessary to include the folders plot_functions, data and evaluation_functions.
We take plagiarism very seriously. Submitting works that are not yours will result in a report submitted to the competent EPFL authorities.
Principal Component Analysis (PCA) is a well-known dimensionality reduction technique which projects the data onto a new basis of smaller dimensions. It is used as:
Given a training dataset $\mathbf{X} \in \mathbb{R}^{N \times M}$ composed of $M$ datapoints with $N$-dimensions each; i.e. $\mathbf{X} = \{\mathbf{x}^1, \mathbf{x}^2, ..., \mathbf{x}^M\}$ where $\mathbf{x}^i \in \mathbb{R}^N$ we would like to find a lower-dimensional projection $\mathbf{Y} \in \mathbb{R}^{p \times M}$ (i.e. $\mathbf{Y} = \{\mathbf{y}^1,\mathbf{y}^2,, ..., \mathbf{y}^M\}$ where $\mathbf{y}^i \in \mathbb{R}^p$) of $\mathbf{X}$ through a linear map $A:\mathbf{x} \in \mathbb{R}^{N} \rightarrow \mathbf{y} \in \mathbb{R}^{p \leq N}$ given by
\begin{equation*} \label{eq:linearmap} \mathbf{y} = A\mathbf{x} \end{equation*}where the projection matrix $A \in \mathbb{R}^{p \times N}$ is found through Principal Component Analysis.
For a thorough description of PCA, its application and derivation, refer to the PCA slides from the Applied Machine Learning Course.
By computing the covariance matrix of the training data $C = \mathbb{E}\{ \mathbf{x}\mathbf{x^T} \}$ and extracting its eigenvalue decomposition $C = V \Lambda V^T$, the projection matrix $A$ is constructed as $A=V^T$ where $V = [e^1,\dots,e^N]$. To reduce dimensionality, one then chooses a sub-set of $p$ eigenvectors $e^i$ from $V$.
We will follow the steps of the PCA algorithm shown on slide 62 of the class, starting with the centering step,
\begin{equation} \mathbf{x} \rightarrow \mathbf{x} - \mathbb{E}\{\mathbf{x}\} \end{equation}where the expectation of a random variable $\mathbb{E}\{\mathbf{x}\}$ is in fact it's mean.
The substraction of the mean is necessary for the eigenvalue decomposition of the covariance matrix to yield the expected result. Hence, the $\mathbb{E}\{\mathbf{x}\}$ of a multi-dimensional vector is the mean vector of all dimensions $\mathbf{\mu_X} = [\mu_1;\mu_2;\dots;\mu_N]$. For each $j$-th dimension, its mean $\mu_j$ can be computed as follows:
\begin{equation*} \mu_j = \frac{1}{M} \sum^M_{i=1} x_j^i. \end{equation*}Once the data is demeaned, the next step in PCA is to compute the Covariance matrix $C = \mathbb{E}\{ \mathbf{x}\mathbf{x^T} \}$ of our centered (zero-mean) data, as follows:
\begin{equation} C = \frac{1}{M-1} \sum_{i=1}^{M} (\mathbf{x}^i) (\mathbf{x}^i)^T \label{eq:cov} \end{equation}The covariance matrix $C$ can be written in matrix formulation as follows:
\begin{equation*} C = \frac{1}{M-1} XX^{T} \end{equation*}The reason $C$ is normalized by $M-1$ instead of $M$ is due to the fact that the true $\mathbb{E}\{\mathbf{x}\}$ is not known. According to Bessel's correction, one must use $M-1$ in order to have an unbiased estimate, because we use the sample mean $\mathbf{\mu_X}$ instead of the true mean. Refer to Applied Multivariate Statistical Analysis book for further reading on this subject.
The principal components of our dataset can then be computed by extracting the eigenvalues and eigenvectors of the covariance matrix $C$. This is generally done by solving the following equations:
\begin{equation*} \qquad Ce^i = \lambda_i e^i \implies \det(C - \lambda I)=0 \end{equation*}which leads to the solution, \begin{equation} \label{eq:evd} C = V \Lambda V^T \end{equation}
where $V \in \mathbb{R}^{N \times N}$ is the matrix composed of the eigenvectors $e^i$ on each column and $\Lambda \in \mathbb{R}^{N \times N}$ is a diagonal matrix of eigenvalues $\lambda_i$.
After performing the decomposition you also need to sort the eigenvectors and eigenvalues in descending order of the eigenvalues. As a reminder, the eigenvector with the highest eigenvalue is the one that encapsulate the most the variance of your data.
Implement the PCA algorithm described previously in the file compute_pca.m. The file contains a function with the following signature:
function [Mu, C, EigenVectors, EigenValues] = compute_pca(X)
It takes as input some data stored in the variable Xa and returns elements as details in the steps 1, 2 and 3 of the PCA algorithm. The output EigenValues is correponds to a diagonal matrix of the eigenvalues. To save storage space, it should be stored only as a vector instead of a diagonal matrix.
Implementation hint: You can look at the following functions to ease the implementation: mean, eig or svd, sort.
Evaluate your implementation: At this point, you can test that you have implemented compute_pca.m function correctly, by running the corresponding code blocks in the MATLAB script tp1_pca_part1.m. This will load the Iris dataset directly from Matlab. As a reminder, the Iris dataset contains the four measurements of 150 Iris flowers. Features are stored in the meas array of dimensions $150 \times 4$. You should pay a great attention to the dimensions of the meas array and the dimensions expected by your implementation.
One then chooses the number of $p$ components/eigenvectors to keep and selects the first $p$ columns of $V$ to generate the projection matrix $A_p$
\begin{equation} A_p = [e^1,e^2,\dots,e^p]^T \qquad \text{for} \qquad \lambda_1\geq \lambda_1 \dots \geq \lambda_p \end{equation}One can then project the zero-mean dataset to the lower-dimensional space as follows:
\begin{equation} Y = A_pX \end{equation}Recall that the projection matrix $A_p$ was computed with zero-mean data, hence $X$ must be the zero-mean dataset.
You must implement the projection in the file project_pca.m. The function to implement has the following signature:
function [Y, Ap] = project_pca(X, Mu, V, p)
Inputs Mu, and V correspond to the mean and eigenvectors returned by the pca function previously implemented. p is a parameter set in tp1_pca_part1.m script. You can modify it there to observe its effects.
Evaluate your implementation: You should run the corresponding block in tp1_pca_part1.m to evaluate your implementation.
From the output $Y$ of lower dimension one can reconstruct a lossy version $\mathbf{\hat{X}}$ of the original dataset $\mathbf{X}$ through mean-square projection as follows:
\begin{equation} \mathbf{\hat{X}} = A_p^{-1}\mathbf{Y} + \mathbb{E}\{\mathbf{X}\}. \end{equation}The file reconstruct_pca.m contains a function to be implemented to perform the reconstruction. This function has the following signature:
function [Xhat] = reconstruct_pca(Y, Ap, Mu)
Evaluate your implementation: You should run the corresponding block in tp1_pca_part1.m to evaluate your implementation.
As the reconstruction $\mathbf{\hat{X}}$ is calculated from a lower dimension than the original data, some information has been lost in the process. One can evaluate the loss, or reconstruction error, from the norm:
\begin{equation} e_{rec} = ||\mathbf{X} - \mathbf{\hat{X}}||_2 \end{equation}The file reconstruction_error.m contains a function to be implemented to perform the error calculation. This function has the following signature:
function [err] = reconstruction_error(X, Xhat)
Implementation hint: You can look at the following functions to ease the implementation: norm.
Evaluate your implementation: You should run the corresponding block in tp1_pca_part1.m to evaluate your implementation.
In the script file tp1_pca_part1.m, we apply the PCA to the Iris flower dataset seen during the exercise session. The original dataset is displayed as a scatter matrix as follow:
plot_scatter_matrix(X, {'sepal length','sepal width','petal length','petal width'}, "Original dataset");
When projected on the lower dimensional space, data can still be visualized using the same representation. Howver, axis will now correspond to eigenvectors and do have a physical meaning difficult to understand. Indeed, each eigenvector is a linear combination of all the features.
plot_scatter_matrix(Yproj, [], "Projection on eigenvectors with p=" + string(p));
Finally, we can also display the reconstructed dataset from the lower space.
plot_scatter_matrix(Xhat, {'sepal length','sepal width','petal length','petal width'}, "Reconstructed dataset");
As we can see this results in some discrepancies with the original dataset.
The number of principal components $p$ can be estimated to satisfy some constraints. It can be the reconstruction error calculated previously, e.g. accepting only $20\%$ of loss. Another metric lies in the analysis of the eignevalues.
When the values corresponding to each principal component (eigenvalues) become constant and close to $0$, it could mean that the remaining components are possibly contaminated with noise and can therefore be considered irrelevant. This can be analyzed by extracting the diagonal values $\lambda_i$ of $\Lambda$ and plotting them wrt. their corresponding eigenvector index.
If we plot the eigenvalues of the PCA applied on the Iris dataset we obtain the following representation.
plot_eigenvalues(EigenValues);
In this case, eigenvalues $2$, $3$ and $4$ are almost constant and closed to 0, therefore, the first eigenvector is sufficient to represent the Iris dataset. For dataset with much higher dimensions, this approach can, however, be difficult to interpret.
Instead of searching for the minimum number of dimensions, another approach is to search for the optimal number of dimensions to keep in order to explain a certain amount of variance of the dataset. The eigenvalues $\Lambda = [\lambda_1,\dots, \lambda_N]$ give a measure of the variance of the distribution of $X$ on each projection. If one wants to reduce the dimensionality of the dataset, and at the same time keep a reasonable amount of information of the data, one can find the appropriate $p$ that yields the desired percentage of explained variance $\sigma$, by normalizing the eigenvalues as follows:
\begin{equation} \lambda_i^{var} = \frac{\lambda_i}{\sum_{j=1}^{N}\lambda_j}. \end{equation}This corresponds to the percentage of the dataset covered by the $i$-th projection. One then computes a vector of the cumulative sum of the set of normalized eigenvalues
\begin{equation} \mathbf{\lambda_{cum}} = [\lambda_1^{var}, \lambda_1^{var} + \lambda_2^{var}, \lambda_1^{var} + \lambda_2^{var} + \lambda_3^{var}, \dots, \lambda_1^{var} + \dots + \lambda_N^{var}]. \end{equation}One can then use $\mathbf{\lambda_{cum}}$ to estimate the optimal $p$ for a desired percentage of explained variance $\sigma$ as
\begin{equation} p = \underset{k}{\text{argmin}}(\mathbf{\lambda_{cum,k}}) \quad \text{with} \quad \mathbf{\lambda_{cum,k}} > \sigma. \end{equation}This approach, applied on the Iris dataset with $\sigma = 0.95$ yields the following graphics.
plot_explained_variance(ExpVar, CumVar, p_opt);
In this case, if we want to keep $95\%$ of the variance of the dataset, we will need to take at least two eigenvectors to represent the data.
The file explained_variance.m contains a function to be implemented to perform the explained variance calculation. This function has the following signature:
function [ExpVar, CumVar, p_opt] = explained_variance(EigenValues, var_threshold)
The var_threshold ($\sigma$ in the equations) is a parameter used to determinate how many principle components should be kept. For example, setting it to 0.8 means that the components that explain $80\%$ of the variance of the dataset will be kept.
Implementation hint: You can look at the following functions to ease the implementation: cumsum.
Evaluate your implementation: You should run the corresponding block in tp1_pca_part1.m to evaluate your implementation.
This concludes the implementation of the PCA algorithm and the part 1 of the assignment. In part 2 and 3 you will apply the function implemented to an image compression problem and to real data.
We can use the ideas behind PCA to perform image compression. Images are a combination of side by side pixels of color values. As an image is not made of random colors and shapes it is reasonable to think that there should be some underlying structure. In other words, if given a pure white pixel in an image, the chances that the surrounding pixels will be white or some variant of white is probably high. Compressing an image is simply removing superfluous pixels. In order not to loose quality we need to find a subspace that requires less pixels to be stored while still being able to perform a reconstruction at the same quality level as the original data. Hence the usage of PCA functionality.
An image is a $height \times width \times 3$ matrix of decimal values where the last dimension corresponds to the Red, Green and Blue color channels respectively. To simplify the calculation one can perform the PCA on each channel individually considering that they are independent, and over the $width$, i.e without the need to reshape or transpose the image. More precisely, for each channel, the image's columns are considered here as sample vectors (datapoints) in the sense of Part 1 of this assignment. Compression can be achieved by performing the following steps for each channels:
The file compress_image.m contains a function to be implemented to perform the compression of the image given in parameters. This function has the following signature:
function [cimg, ApList, muList] = compress_image(img, p)
The outputs cimg, ApList, muList are placeholders to store $Y$, $A_p$ and $\mu$, respectively, of the PCA run over each independent channel. They are all array of matrices and their last dimension should, therefore, be $3$, i.e. the number of channels, similarly to the input image.
Implementation hint: Be careful on the dimensions of the input and output variables. Expected output dimensions are given in the documentation of the function.
One can estimate the memory necessary to store the compressed image by looking at the dimensions of the stored ouptuts $Y$, $A_p$ and $\mu$ that are needed when reconstructing the image. The compression size corresponds to the total number of elements in each of the matrices multiply by the size of a single element. As each single element is a decimal number we can assume that they are stored with 64 bits (double precision). The compression rate is then calculated as:
\begin{equation} c_r = 1 - \frac{s_Y + s_{A_p} + s_{\mu}}{s_{img}} \end{equation}where $s$ is a function that calculates the number of bits needed to store all the elements in the different matrices.
The file compression_rate.m contains a function to be implemented to perform the calculation of the compression rate from the given parameters. This function has the following signature:
function [cr, compressedSize] = compression_rate(img,cimg,ApList,muList)
The output compressedSize corresponds to the size to store all the elements needed for reconstruction, expressed in megabits. The factor to convert bits to megabits is given in the function and is:
Implementation hint: You can look at the following functions to ease the implementation: numel.
The elements stored after PCA, i.e. $Y$, $A_p$ and $\mu$ are all we need to store the image in the lightest possible way. However, if someone wants to visualize the corresponding image a reconstruction step is necessary. Simirlarly to the image compression, one should now apply the reconstruction step of the PCA to each of the independent channels.
The file reconstruct_image.m contains a function to be implemented to perform the reconstruction of the compressed image. This function has the following signature:
function [rimg] = reconstruct_image(cimg, ApList, muList)
Evaluate your implementation: You should run the corresponding block in tp1_pca_part2.m to evaluate your implementation.
Running the full script tp1_pca_part2.m should now output three reconstructed image with different number of principle components.
plot_image_reconstruction(img)
Although quite unusual, this is an interesting application of PCA that relies on the fact that it is mainly a technique to reduce the dimensionality of input data. In the final part of this practical we will see that it can also be used to get some insights on the data we are analyzing.
Cryptocurrencies is a new market for financial traders. There exist many different coins that are exchanged between one another at a fluctuent conversion rate. The most famous one, bitcoin trades at around $20000\$$ per coin at the time of writing. The website coinmarketcap keeps track of all the exchanges made between cryptocurrencies and calculate an exchange value in dollar as well as provide some other interesting metrics.
Since the explosion of bitcoin price over the last years and the the ever growing number of cryptocurrencies in the market, it is a well believed fact that the trend of bitcoin, i.e. its fluctuations in terms of dollar price is a strong indicator of the trends of the other coins. In other terms, when the price of bitcoin goes up, the price of the other coins follow.
If this affirmation is true, then this means that by running a PCA over the price of many coins, only one eigenvector should be sufficient, and the projection over this eigenvector should be quite ressemblent with the bitcoin data.
To perform this analysis, we will first load the cryptocurrencies_prices.csv data file that contains the prices of 10 cryptocurrencies among the top ones as listed on coinmarketcap over a common period of 200 days, between 2017 and 2018.
When loading the dataset included in cryptocurrencies_prices.csv and applying a PCA, we can check how much variance is explained by the eigenvectors:
plot_explained_variance(ExpVar, CumVar, p_opt);
It turns out that all of the variance is explained by a single eigenvector, which seems to corroborate the assumption we made previously. Now if we plot the projection of the dataset on this eigenvector and compare it with the bitcoin data we obtain the following:
plot_bitcoin_comparison(data, Yproj)
This is closed to be a perfect match! So it means that bitcoin is the king of cryptocurrencies and the variance of all the other coins is only dependent on the variance of bitcoin, right?. Well, maybe you should not rush the conclusion and invest all your savings yet. Have you looked at the data in the table variable? You probably should...
If we look at the data given in cryptocurrencies_prices.csv we observe that the price of bitcoin varies much more and in much higher ranges than the price of some other coins. Therefore, it is not a surprise that the previous analysis lead to such results. To account for this, one should perform, first, a normalization step, prior to running a PCA analysis.
There exists two types of normalization. This first one, often referred as minmax normalization is performed using the following formula:
\begin{equation} X' = \frac{X - x_{min}}{x_{max} - x_{min}} \end{equation}The result of this normalization is to ensure that all the values of each features are between $0$ and $1$, while maintaining the variance of each feature present in the dataset. This means that outliers are not handled by this normalization technique.
On the other hand, the second normalization named z-score ensures that each feature has $0$ mean and lies in $1$ standard deviation. This is achieved using the following formula:
\begin{equation} X' = \frac{X - \mu}{\sigma} \end{equation}where $\mu$ is the mean of the data in $X$ and $\sigma\$ the standard deviation.
The file normalize.m contains a function to be implemented to perform normalization step. This function has the following signature:
function [X, param1, param2] = normalize(data, normalization, param1, param2)
The input normalization is a string that can take the following values: minmax, zscore, or none. Depending on this value your implementation should output the corresponding normalized data in X and the normalization parameters as:
param1 = Xmin and param2 = Xmax for the minmax normalizationparam1 = mu and param2 = std for the zscore normalizationparam1 = 0 and param2 = 0 for no normalizationIf param1 and param2 are given as parameter, they should not be recalculated and used directly in the calculations. This allows to reuse already calculated parameters on new data that were not part of the original dataset.
Implementation hint: You can look at the following functions to ease the implementation: min, max, mean, std. Be careful that the values calculated in param1 and param2 should correspond to the min/mean, max/std values of each feature independently. Therefore, it should be vectors of dimension equal to the number of features. To check if a parameter as been given as input of a function you can use nargin.
Evaluate your implementation: You should run the corresponding block in tp1_pca_part3.m to evaluate your implementation.
If one normalize data, it should also be able to apply the inverse process, i.e. denormalize them from the calculated parameters during the normalization process.
The file denormalize.m contains a function to be implemented to perform denormalization step. This function has the following signature:
function [Xinversed] = denormalize(X, param1, param2, normalization)
Here again the normalization input is a string that can take the values minmax, zscore or none. param1 and param2 corresponds to the parameters calculated during the normalization step.
Evaluate your implementation: You should run the corresponding block in tp1_pca_part3.m to evaluate your implementation.
Let us now run again the analysis, but this time after normalization. First, we will see the effects of the minmax normalization:
subplot(1,2,1)
plot_explained_variance(ExpVar, CumVar, p_opt);
subplot(1,2,2)
plot_bitcoin_comparison(X, Yproj)
As it turns out, to keep $95\%$ of the variance, you need at leat $7$ eigenvectors now, instead of only $1$ previously. And when projecting on the first eigenvector, the comparison with the bitcoin price is not so evident anymore. What about the zscore normalization then?
cd ~/mlp/2-PCA/TP1-PCA/TP1-PCA-Description/solution
%%file tp1_pca_part3.m
subplot(1,2,1)
plot_explained_variance(ExpVar, CumVar, p_opt);
subplot(1,2,2)
plot_bitcoin_comparison(X, Yproj)
It seems to yield similar results. So this clearly show that normalization is an important process and is sometime necessary. But this does not mean that you should always normalize the data. Sometimes, keeping the fact that one feature varies much more than the others can be desirable.