Machine Learning Programming¶

Assignment 5: Gaussian Mixture Model Applications (GMMApps)¶

This series of practicals focus on the development of machine learning algorithms introduced in the Applied Machine Learning course. In this practical, you will apply the previously programmed Gaussian Mixture Models (GMM) algorithm to clustering, regression and resampling problems.

Deadline: December 10, 2024 @ 8pm

Assignments must be turned in by the deadline. 1/6 point will be removed for each day late (a day late starts one hour after the deadline).

Procedure: From the course Moodle webpage, you must download and extract the .zip file named TP5-GMMApps-Assignment.zip. The folder contains the following elements:

  • tp5_gmmapps_part1.m
  • tp5_gmmapps_part2.m
  • tp5_gmmapps_part3.m
  • functions
    • part1
    • part2
    • part3
  • plot_functions
  • data
  • utils

Only the elements in blue need to be filled out. The rest are helper functions or data to be imported. This assignment will be less guided than the previous ones. There are no evaluation functions provided.

You must submit an archive as a .zip file with the following naming convention TP5-GMMApps-SCIPER.zip, where SCIPER need to be replaced by your own sciper number. You must submit only the files/folders in blue. DO NOT INCLUDE the folders plot_functions, data, and utils.

We take plagiarism very seriously. Submitting works that are not yours will result in a report submitted to the competent EPFL authorities.

Part 1: Classification with GMM+Bayes¶

In the previous assignment, we have implemented the GMM model to fit a dataset in a complete unsupervised manner. To recall, given a dataset $\mathbf{X} = \{\mathbf{x}^1,\dots,\mathbf{x}^M\}$ where $\mathbf{x}^i \in \mathbb{R}^N$, the probability density function (pdf) of a $K$-component GMM is of the form,

\begin{equation} p(\mathbf{x}|\Theta) = \sum\limits_{k=1}^{K} \alpha_k p(\mathbf{x}| \mathbf{\mu}^k, \mathbf{\Sigma}^k) \end{equation}

where $p(\mathbf{x}| \mathbf{\mu}^k, \mathbf{\Sigma}^k)$ is the multivariate Gaussian pdf with mean $\mathbf{\mu}^k$ and covariance $\mathbf{\Sigma}^k$

\begin{equation} \begin{aligned} p(\mathbf{x}|\mu^k,\Sigma^k) & = \frac{1}{(2\pi)^{N/2}|\Sigma^k|^{1/2}} \exp \left\lbrace -\frac{1}{2}(\mathbf{x} - \mu^k)^{T} ( \Sigma^k )^{-1} (\mathbf{x} - \mu^k) \right\rbrace\\ \end{aligned}. \end{equation}

$\Theta= \{\theta^1, \dots, \theta^k\}$ is the complete set of parameters $\theta_k= \{\alpha_k, \mu^k, \Sigma^k\}$, where $\alpha_k$ are the priors (or mixing weights) of each Gaussian component, satisfying the constraint $\sum_{k=1}^{K}\alpha_k = 1$. These parameters are typically learned through the iterative Expectation-Maximization algorithm, which was implemented in the previous assignment.

In this part, we want to decide to which class each datapoint belongs to with models of each class learnt a priori, in other words classification. To perform classification, we need to create a training dataset (i.e. a labeled dataset to build our model) and a testing dataset to evaluate the performance of the algorithm. To perform classification with GMMs, the first step is to learn a GMM for each class from the training set. Given a model for each class, we can use the Gaussian Likelihood Discriminant Rule to classify the testing datapoints.

Learning a GMM for each class¶

You will have access to a library of functions in utils/datasets folder that generate datasets of 2 to 4 classes as shown below. Those functions return sets of $(x,y)$ points where the class labels $y$ are stored in the last column of the returned variable. Each functions can take optional parameters but the easiest usage is to use the default values by passing no arguments. An example of the functions calling is written in the file datasetsdemo.m and produces the following figure.

In [6]:
datasetsdemo()

Using the previously developed functions for learning the parameters of a Gaussian Mixture model via EM algorithm, one can learn one GMM for each of the classes on the training dataset.

Gaussian Maximum Likelihood Discriminant Rule¶

A maximum likelihood classifier chooses the class label that is the most likely. For a binary classification problem, a new datapoint $\mathbf{x'}$ belongs to class $1$ if the corresponding likelihood is superior to the likelihood of class 2,

\begin{equation} p(y=1|\mathbf{x'}) > p(y=2|\mathbf{x'}). \end{equation}

By Bayes, we have \begin{equation} p(y=i|\mathbf{x'}) = \dfrac{p(\mathbf{x'}|y = i) p(y = i)}{p(\mathbf{x'})}, \text{with class $i = 1, 2$}. \end{equation}

Replacing this term in the Maximum Likelihood (ML) Discriminant Rule for $\mathbf{x'}$ in class $1$, we obtain

\begin{equation} p(\mathbf{x'}|y = 1) p(y = 1) > p(\mathbf{x'}|y = 2) p(y = 2). \end{equation}

In general, one can assume equal class distribution, i.e. $p(y=1) = p(y=2)$. By replacing in the previous equation, the discriminant rule becomes

\begin{equation} p(\mathbf{x'}|y = 1) > p(\mathbf{x'}|y = 2). \end{equation}

For a GMM composed of $K^i$ multivariate Gaussian functions, the conditional densities to belong to class $i$ is similar to the PDF equation,

\begin{equation} p(\mathbf{x'}|y = i) = \sum\limits_{k=1}^{K^i} \alpha_k p(\mathbf{x'}| \mathbf{\mu}^k, \mathbf{\Sigma}^k), \end{equation}

with $p(\mathbf{x'}| \mathbf{\mu}^k, \mathbf{\Sigma}^k)$ defined according to PDF equation of the GMM.

In the case of a multi-class problem with $i = 1... I$ classes, the ML Discriminant Rule is the minimum of the log-likelihood (i.e., equivalent to maximizing the likelihood). Assuming equal class distribution, the ML Discriminant Rule is:

\begin{equation} c_i(\mathbf{x'}) = \mathop{\mathrm{argmin}} \left\lbrace -\log(p(\mathbf{x'}|y = i)) \right\rbrace. \end{equation}

In the case that one does not make the assumption of equal class distribution, the ML Discriminant Rule is:

\begin{equation} c_i(\mathbf{x'}) = \mathop{\mathrm{argmin}} \left\lbrace -\log(p(\mathbf{x'}|y = i)p(y=i)) \right\rbrace. \end{equation}

TASK 1: Implement initialize_parameters.m function (2pts)¶

The function has the following signature:

In [ ]:
function [data, Xtrain, Ytrain, Xtest, Ytest, params] = initialize_parameters(dataset_type)

Given a string dataset_type which indicates the dataset to select (i.e the function to call from the utils/datasets folder among halfkernel, corners, and twospirals), this function returns the generated data and a splitted training and testing datasets. It also initializes the structure of hyperparameters params needed for the GMM algorithm. You have to provide a set of parameters that will allow to learn a GMM model on each of the three dataset ensuring no overfitting or underfitting. The parameters to fill are:

  • training/testing ratio: the ratio between train and test dataset. It is a single double value. Be careful about its meaning. 0.5 is not a valid option.
  • number of Gaussians k: as stated, represents the number of Gaussians, as an integer value, to use in the GMM.
  • type of covariance matrix: a string (iso, diag or full) representing the type of covariance matrix of the Gaussians.

This choice of parameters, especially the number of Gaussians and the type of covariance, has to take into account the complexity of the chosen model. These parameters vary depending on the selected dataset.

Note: We do not ask you to compute evaluatiom metrics that would automatically provide the combination of hyperparameters the most adapted. This will be done in Part 3. Here we ask you to give an educated guess. There are multiple answers possible.

Useful functions: All the dataset generation functions from utils/datasets folder, split_data function from the utils folder.

Task 2: Implement gmm_models.m function (2pts)¶

The function has the following signature:

In [ ]:
function [models] = gmm_models(Xtrain, Ytrain, params)

Using the previously developed functions for learning the parameters of a Gaussian Mixture model via EM algorithm, learn a GMM for each of the classes on the training dataset. Note that you should consider one GMM per class. Here we consider that each GMM uses the same hyperparameters (number of Gaussians and type of covariance matrix).

The function should return the learned parameters $(\alpha, \mu, \Sigma)$ for all the GMMs, stored in the output structure array models.

Useful function: gmmEM

Running the script tp5_gmmapps_part1.m produces an image of the contour of the learned GMM for each of the class labels. For example, on the halfkernel dataset:

In [18]:
plot_fitted_dataset(Xtrain, Ytrain, models)

Note: The image you will obtain obviously depends on the hyperparameters you have selected to train the GMM. Therefore, it might differ from the one above.

TASK 3: Implement gmm_classifier.m function (3pts)¶

The function has the following signature:

In [ ]:
function [Yest] = gmm_classifier(Xtest, models, labels)

Implement the ML Discriminant Rule to estimate labels y_est for each datapoints in Xtest assuming equal class distribution. For this assignment you don't have to consider the case of unbalanced dataset but bear in minds that this requires extra care in case you encounter it in the future. The labels argument is a vector containing the mapping between the index of the GMM and the associated label. Remember that indexing starts at 1 in Matlab but index 1 could correspond to label 0, for example.

We can oberve the results by plotting the boundaries of the classification on the testing dataset.

In [21]:
plot_boundaries(Xtrain, Ytrain, Xtest, Ytest, Yest,  models);

Part 2: Resampling from learned GMMs¶

One of the biggest problem of machine learning is the cost of gathering data coupled to the fact that complex ML algorihtms requires a lot of them. The beauty of GMMs is that it is able to learn the underlying structure of a given set of points, which is often refered as density estimation. That is to say, the result of a GMM fit to some data is technically not a clustering model, but a generative probabilistic model describing the distribution of the data. Therefore, one can draw samples from such a probabilistic model, i.e. generate new data points that are not part of the dataset but share its underlying structure. With this, one can create fake data to extand its dataset, or also solve the unbalanced dataset problem by generating more samples of the classes that are missing some.

Drawing samples from a GMM is using the probabilistic model as a random generator. This is a two-steps process. First, one must select which Gaussian $k$ to draw a sample from based on the prior probabilities $\alpha$ of the GMM. A Gaussian with a higher prior probability is most likely to be selected. Once a Gaussian is selected, a datapoint is drawn from the multivariate distribution given by the paramater $\mu$ and $\Sigma$ of the selected Gaussian.

Task 4: Implement sample_from_gmm.m function (2pts)¶

The function has the following signature:

In [ ]:
function [XNew] = sample_from_gmm(gmm, nbSamples)

It takes as input the gmm structure that contains the parameters of the learned GMM, i.e $\alpha$, $\mu$ and $\Sigma$. The second argument nbSamples corresponds to the number of new samples to generate from the learned GMM.

Useful functions: randsrc(), mvnrnd()

By running the script tp5_gmmapps_part2.m you can test this approach on the datasets from part 1. For example, on the halfkernel dataset, if we use k=4, this produces:

In [31]:
scatter(XNew(1,:), XNew(2,:), dotsize);

As we can see, because our dataset is underfitted this results in poorly representative samples. Increasing to k=25 yields better results:

In [33]:
scatter(XNew(1,:), XNew(2,:), dotsize);

Sampling high dimensional data¶

We can also apply the same method to a large dimensional dataset such as the mnist handwriting dataset:

In [37]:
plot_digits(Xtrain)

Because the number of dimensions is too high, one must first apply the PCA algorithm on the data and learn the GMM on the latent space (result of the PCA). Applying the previous function, with k=200 gives:

In [39]:
plot_digits(XHat)

Sampling with supervised learning¶

Instead of using an unsupervised approach, we can also rely on the fact that we know the labels associated to the existing samples and draw new samples for a specific class. To do that we will use the clustering method from part 1 and apply it on the mnist dataset. Instead of sampling from a single GMM learned over the whole dataset, we will now sample from the model of the 10 GMM (one GMM learned per class).

Task 5: Implement sample_from_gmmModels.m function (2pts)¶

The function has the following signature:

In [ ]:
function [XNew] = sample_from_gmmModels(models, nbSamplesPerClass, desiredClass)

The desiredClass argument is optional and, if provided allows to generate only samples of a desired class. If not provided, it should then generate samples for all the classes. Using this supervised approach yields the following results:

In [42]:
plot_digits(XHat)

When asking specefically for the class 4 we obtain:

In [49]:
plot_digits(XHat)

Part 3: Gaussian Mixture Regression (GMR)¶

Regression problems are a category of supervised learning algorithms that seeks to approximate the true mapping function $\mathbf{y}=f(\mathbf{x})$ from inputs $\mathbf{x}$ to outputs $\mathbf{y}$ from a set of training data. As opposed to classification, in regression, the output variables are continuous $y\in\mathcal{R}$. Such regression problems are found in a wide spectrum of research from financial to robotic applications.

Any regression problem is posed as follows, given a training data set of inputs $\mathbf{X} = \{ \mathbf{x}^1,\dots,\mathbf{x}^M \}$ of $M$ samples, where $\mathbf{x}^{i}\in \mathbb{R}^N$ are $N$-dimensional inputs and a set of corresponding outputs $\mathbf{y} = \{ \mathbf{y}^1,\dots,\mathbf{y}^M \}$, where $\mathbf{y}^i \in \mathbb{R}^P$ are continuous $P$-dimensional outputs, we seek to approximate a continuous mapping function $\hat{f}: \mathbb{R}^N \rightarrow \mathbb{R}^P$, $\hat{\mathbf{y}}=\hat{f}(\mathbf{x})$, that best recovers the true regressive function $\mathbf{y}=f(\mathbf{x})$.

Many linear/non-linear methods exist to obtain this regressive function $f(\mathbf{x})$. In this assignment we will implement a version using our previously implemented GMMs, Gaussian Mixture Regression. For further reading on these regression methods, refer to the Regression Slides or the lecture Notes from the Applied Machine Learning course.

Gaussian Mixture Regression is a parametric model-based method, capable of learning a regressive function of non-linear datasets. It stems from estimating a joint density of the inputs and outputs through a $K$-component Gaussian mixture model as follows:

\begin{equation} p(\mathbf{x},\mathbf{y} |\Theta ) = \sum\limits_{k=1}^{K}\alpha_k p(\mathbf{x},\mathbf{y}|\mathbf{\mu}^k, \mathbf{\Sigma}^k), \end{equation}

where $\alpha_k$ are the priors of each Gaussian component parametrized by $\{\mathbf{\mu}^k. \mathbf{\Sigma}^k\}$ where

\begin{equation} \sum\limits_{k=1}^{K}\alpha_k = 1, \qquad \mu^k = \begin{bmatrix} \mathbf{\mu}^k_{\mathbf{x}} \\ \mathbf{\mu}^k_{\mathbf{y}} \end{bmatrix}, \qquad \mathbf{\Sigma}^k = \begin{bmatrix} \mathbf{\Sigma}^k_{\mathbf{xx}} & \mathbf{\Sigma}^k_{\mathbf{xy}}\\ \mathbf{\Sigma}^k_{\mathbf{yx}} & \mathbf{\Sigma}^k_{\mathbf{yy}} \end{bmatrix}. \end{equation}

Given this joint density, one can compute the conditional density $p(\mathbf{y}|\mathbf{x})$ which will provide our regressive function $y = f(x)$,

\begin{equation} p(\mathbf{y}|\mathbf{x}) = \sum\limits_{k=1}^{K}\beta_k p(\mathbf{y}|\mathbf{x};\mathbf{\mu}^k, \mathbf{\Sigma}^k) \end{equation}

where

\begin{equation} \label{eq:beta} \beta_k = \frac{\alpha_k p(\mathbf{x}|\mathbf{\mu_x}^k, \mathbf{\Sigma_{xx}}^k)}{\sum\limits_{k=1}^{K} \alpha_k p(\mathbf{x}|\mathbf{\mu_x}^k, \mathbf{\Sigma_{xx}}^k)}. \end{equation}

The $\beta_k$ parameters are known as the mixing weights (i.e. relative importance) of each $K$-th regressive model $p(\mathbf{y}|\mathbf{x};\mathbf{\mu}^k, \mathbf{\Sigma}^k)$. The regressive function is then obtained by computing the expectation over the conditional density $\mathbb{E}\{p(\mathbf{y}|\mathbf{x})\}$ as follows:

\begin{equation} \label{eq:GMR} \begin{aligned} y = & f(\mathbf{x}) = \mathbb{E}\{p(\mathbf{y}|\mathbf{x})\} = \sum\limits_{k=1}^{K}\beta_k(\mathbf{x})\tilde{\mu}^k(\mathbf{x}) \end{aligned} \end{equation}

where

\begin{equation} \tilde{\mu}^k(\mathbf{x}) = \mu_{\mathbf{y}} ^k + \mathbf{\Sigma}_{\mathbf{yx}}^k(\mathbf{\Sigma}_{\mathbf{xx}}^k)^{-1}(\mathbf{x}-\mu_{\mathbf{x}}^k) \end{equation}

are the local regressive functions, which represent a linear function whose slope is determined by $\mathbf{\Sigma}_{\mathbf{x}}^k$ (the variance of $\mathbf{x}$) and $\mathbf{\Sigma}_{\mathbf{xy}}^k$ (the covariance of $\mathbf{y}$ and $\mathbf{x}$). Our regressive function $y=f(\mathbf{x})$ is thus a linear combination of $K$ regressive models. The advantage of GMR over other non-linear regressive methods is that we can also compute the uncertainty of the prediction (i.e. the regressed output $\hat{y}$) by computing the variance of the conditional function $Var\{p(\mathbf{y}|\mathbf{x})\}$ as follows:

\begin{equation} \label{eq:GMR-var} Var\{p(\mathbf{y}|\mathbf{x})\} = \sum\limits_{i=1}^{K} \beta_k(\mathbf{x}) \left( \tilde{\mu}^k(\mathbf{x})^2 + (\tilde{\mathbf{\Sigma}}^k) \right) - \left( \sum\limits_{i=1}^{K} \beta_k(\mathbf{x})\tilde{\mu}^k(\mathbf{x}) \right)^2 \end{equation}

where $\tilde{\mathbf{\Sigma}}^k$ represents the variance of the conditional density for each regressive function and is computed as follows:

\begin{equation} \tilde{\mathbf{\Sigma}}^k = \mathbf{\Sigma}_{\mathbf{yy}}^k - \mathbf{\Sigma}_{\mathbf{yx}}^k(\mathbf{\Sigma}_{\mathbf{xx}}^k)^{-1}\mathbf{\Sigma}_{\mathbf{xy}}^k \end{equation}

Note that $Var\{p(\mathbf{y}|\mathbf{x})\}$ represents the uncertainty of the prediction NOT of the model. To represent the uncertainty of the model one should compute the likelihood of the joint density; i.e. $\mathcal{L}(\Theta|\mathbf{X},\mathbf{y}) = \prod\limits_{i=1}^{M}\sum\limits_{k=1}^{K}\alpha_k p(\mathbf{x},\mathbf{y}|\mathbf{\mu}^k, \mathbf{\Sigma}^k)$. Another advantage of GMR over most regression methods is that it can be used with multi-dimensional inputs $\mathbf{x} \in \mathbb{R}^N$ as well as multi-dimensional outputs $\mathbf{y} \in \mathbb{R}^P$, with the same formulation presented here. For a thorough description of the derivation of this method we recommend reading Hsi Guang Sung's PhD Thesis and Cohn et al's seminal work on learning with statistical models

Step 1: Estimate a Joint Density $p(\mathbf{x},\mathbf{y}|\Theta)$ of your Dataset {$\mathbf{X}$,$\mathbf{y}$} with a GMM}} The first step in implementing GMR is to learn the joint density of the inputs $\mathbf{X} \in \mathbb{R}^{M\times N}$ and outputs $\mathbf{y} \in \mathbb{R}^{M\times P}$, where $N$ and $P$ correspond to the dimensionality of the input and output spaces, respectively. By loading one of the three different 1-D datasets provided in the script tp5_gmmapps_part3.m and selecting the values of $K$ and cov_type you will fit a GMM to your regression dataset.

Step 2: Compute the Expectation and Variance of the Conditional $p(\mathbf{y}|\mathbf{x})$ Once the GMM parameters are estimated for the joint dataset {$\mathbf{X}$,$\mathbf{y}$} we can estimate the regressive function by implementing the previously defined equations.

Task 6: Implement gmr.m function (4pts)¶

The function has the following signature:

In [ ]:
function [y_est, var_est] = gmr(Priors, Mu, Sigma, X, in, out)

The inputs for the function are the GMM parameters $\Theta = \{\mathbf{\alpha},\mathbf{\mu},\mathbf{\Sigma}\}$ estimated in the previous step, as well as $\mathbf{X}$ the input data to regress on and the variables in and out which correspond to the indices of the input ($N$) and output ($P$) dimensions of the $D$-dimensional $D = (N+P)$ joint dataset used to train the GMM parameters. The outputs are y_est and var_est, results of the previously defined equations. in and out arguments are index vectors that can be directly used to index Mu and Sigma at the correct dimensions (cf indexing from the exercise session).

The script tp5_gmmapps_part3.m will load a 1D Regression Datasets. There are three possible dataset, available by changing the value in dataset_type to either 1d-sine, 1d-linear, or 1d-sinc:

In [5]:
plot_regression(X, y, y_true, y_est, var_est, '1d-sine Dataset');
In [8]:
plot_regression(X, y, y_true, y_est, var_est, '1d-linear Dataset');
In [10]:
plot_regression(X, y, y_true, y_est, var_est, '1d-sinc Dataset');

Additionally, the same function can be applied on 2D Datasets. The list of available 2D dataset is 2d-cossine and 2d-gmm. Change the value in dataset_type to load a different dataset:

In [16]:
plot_2Dregression(X, y, ftruth, f)
In [23]:
plot_2Dregression(X, y, ftruth, f)

From all of the previous results, it is evident that GMR is a powerful algorithm, capable of modeling regressive functions with different complexities and topologies. For all datasets except 2d-gmm (i.e. 2D highly nonlinear dataset), GMR approximated a regressive function very close to the true one. The example dataset from the latter case is composed of a highly nonlinear output with much noise, for such datasets GMR would give a good estimate if we used as much parameters as possible, for example if we set $K=50$ this would yield to:

In [26]:
plot_2Dregression(X, y, ftruth, f)

This model provides a fairly better estimate than when $K=4$, however, the complexity of the model is extremely high (i.e. 340 parameters for a dataset of 500 data-points). For this reason, we must apply a model selection procedure using cross-validation as we do for classification problems.

Note that for such type of datasets, GMR might not be the most suitable algorithm. Other type of local non-parametric methods would be more suitable, such as LWR (Locally Weighted Regression), LWPR (Locally Weighted Projected Regression) or GPR (Gaussian Process Regression). If you are interested in learning about these methods, they are covered in the Advanced Machine Learning Course offered at EPFL during the Spring semesters.

Regression metrics¶

The metric you will use during this practical to compare the performance of each regressor will be the $Normalized\ Mean\ Square\ Error\ (NMSE)$ or $the\ Coefficient\ of\ Determination\ (R^2)$. If you have a vector of prediction $\hat{\mathbf{y}}$ and a vector $\mathbf{y}$ of the observed values corresponding to this prediction, you can compute the $Mean\ Square\ Error\ (MSE)$ of the predictor:

\begin{equation} MSE = \frac{1}{M} \sum_{i = 1}^{M} \left ( \hat{\mathbf{y}_{i}} - \mathbf{y}_{i} \right ) ^2 \end{equation}

The $Normalized\ Mean\ Square\ Error\ (NMSE)$ is simply the $MSE$ normalized by the variance of the observed values :

\begin{equation} NMSE = \frac{MSE}{VAR(\mathbf{y})} = \frac{\frac{1}{M}\sum_{i = 1}^{M} \left ( \hat{\mathbf{y}_{i}} - \mathbf{y}_{i} \right ) ^2}{\frac{1}{M-1}\sum_{i = 1}^{M}\left ( \mathbf{y}_{i}-\mu \right ) ^2}, \end{equation}

where $\mu$ is the mean of the observed values : $\mu=\frac{1}{M}\sum_{i=1}^{M}\mathbf{y}_i$.

The $Coefficient\ of\ Determination\ (R^2)$ is defined as :

\begin{equation} R^2 = \left ( \frac{\sum_{i = 1}^M \left ( \mathbf{y}_{i} - \overline{\mathbf{y}} \right ) \left ( \hat{\mathbf{y}_{i}} - \overline{\hat{\mathbf{y}}} \right )}{\sqrt{\sum_{i = 1}^M \left ( \mathbf{y}_{i} - \overline{\mathbf{y}} \right )^2} \sqrt{\sum_{i = 1}^M \left ( \hat{\mathbf{y}_{i}} - \overline{\hat{\mathbf{y}}} \right )^2}} \right ) ^2 = \frac{\left(\sum_{i = 1}^M \left ( \mathbf{y}_{i} - \overline{\mathbf{y}} \right ) \left ( \hat{\mathbf{y}_{i}} - \overline{\hat{\mathbf{y}}} \right ) \right ) ^2}{\sum_{i = 1}^M \left ( \mathbf{y}_{i} - \overline{\mathbf{y}} \right )^2 \sum_{i = 1}^M \left ( \hat{\mathbf{y}_{i}} - \overline{\hat{\mathbf{y}}} \right )^2} \end{equation}

where $\overline{\mathbf{y}}$ and $\overline{\hat{\mathbf{y}}}$ denotes respectively the average of the observed and predicted values of $y$.

Task 7: Implement regression_metrics.m function (3pts)¶

The function has the following signature:

In [ ]:
function [MSE, NMSE, Rsquared] = regression_metrics( yest, y )

Task 8: Implement cross_validation_gmr.m function (2pts)¶

The function has the following signature:

In [ ]:
function [metrics] = cross_validation_gmr( X, y, F_fold, valid_ratio, k_range, params )

The cross_validation_gmr.m function computes the five introduced metrics for a range of $K$ (i.e. AIC, BIC, MSE, NMSE and $R^2$) and store the mean and standard deviation of those metrics in the structure metrics. You will have to implement a F-fold cross-validation, where the data is first split into test and train parts as given by the valid_ratio, and the mean and standard deviation of each metric is calculated for the test data over F-folds. Be careful to use the function split_regression_data instead of the normal split_data function. split_data makes some assumption on the input data which are not valid in this case.

The script tp5_gmmapps_part3.m loads the 1D or 2D datasets previously introduced and run the cross validation on the desired one. For the 1d-sine and 1d-sinc datasets it should produces a result similar to the following. Note that the axis for the $MSE$ metric is on the left hand side of the plot, and the axis for $NMSE$ and $R^2$ metrics is on the right hand side of the plots.

In [30]:
plot_gmr_cross_validation(metrics, k_range)
In [32]:
plot_gmr_cross_validation(metrics, k_range)

According to the AIC/BIC metrics, one could infer that for the 1d-sine dataset K=2 is the optimal model, for the 1s-sinc dataset one could choose either K=4 or K=6. Concerning regression metrics, for the 1D-sine dataset K=2 is also the required number of clusters to effectively model this data for regression, this corresponds to the optimal K from the AIC/BIC analysis. Regarding the 1d-sinc dataset, it can be seen that $K$ = 4 is the optimal value for regression, disambiguating the results from AIC/BIC.

Note: Graphs were generated for a range of k values incremented 1 by 1. For time constraints, in case the code is not optimized, the assignment provides a range of k values incremented 2 by 2. This should not change much the general look of the output graphs. You can also change it line 85 of the script tp5_gmmapps_part3.m.

In [ ]: