Machine Learning Programming¶

Assignment 4: Gaussian Mixture Models (GMM)¶

This series of practicals focus on the development of machine learning algorithms introduced in the Applied Machine Learning course. In this practical, you will code the K-Nearest Neighbors (KNN) algorithm in Matlab and its applications.

Deadline: November 26, 2024 @ 8pm

Assignments must be turned in by the deadline. 1/6 point will be removed for each day late (a day late starts one hour after the deadline).

Procedure: From the course Moodle webpage, you must download and extract the .zip file named TP4-GMM-Assignment.zip. The folder contains the following elements:

  • tp4_gmm_part1.m
  • tp4_gmm_part2.m
  • functions
    • part1
    • part2
  • plot_functions
  • data
  • evaluation_functions
  • utils

Only the elements in blue need to be filled out. The rest are helper functions or data to be imported. The folder evaluation_functions contains encrypted functions that evaluate the code written and provide feedback on the results. In this assignment, only the evaluation functions for the gmm functions in part 1, i.e. the main algorithm, are provided. For the other functions you should compare your results with the graphs.

You must submit an archive as a .zip file with the following naming convention TP4-GMM-SCIPER.zip, where SCIPER need to be replaced by your own sciper number. You must submit only the files/folders in blue. DO NOT INCLUDE the folders plot_functions, data, utils and evaluation_functions.

We take plagiarism very seriously. Submitting works that are not yours will result in a report submitted to the competent EPFL authorities.

Part 1: Gaussian Mixture Models (GMM)¶

A Gaussian Mixture Model (GMM) is a parametric probability density function represented as a weighted sum of $K$ Gaussian densities. GMMs are commonly used as a parametric model of the probability distribution of a dataset $\mathbf{X} = \{\mathbf{x}^1,\dots,\mathbf{x}^M\}$ where $\mathbf{x}^i \in \mathbb{R}^N$. They are popular due to their capability of representing multi-modal sample distributions. The probability density function (pdf) of a $K$-component GMM is of the form,

\begin{equation} \label{eq:gmm} p(\mathbf{x}|\Theta) = \sum\limits_{k=1}^{K} \alpha_k p(\mathbf{x}| \mathbf{\mu}^k, \mathbf{\Sigma}^k) \end{equation}

where $p(\mathbf{x}| \mathbf{\mu}^k, \mathbf{\Sigma}^k)$ is the multivariate Gaussian pdf with mean $\mathbf{\mu}^k$ and covariance $\mathbf{\Sigma}^k$

\begin{equation} \begin{aligned} p(\mathbf{x}|\mu^k,\Sigma^k) & = \frac{1}{(2\pi)^{N/2}|\Sigma^k|^{1/2}} \exp \left\lbrace -\frac{1}{2}(\mathbf{x} - \mu^k)^{T} ( \Sigma^k )^{-1} (\mathbf{x} - \mu^k) \right\rbrace\\ \end{aligned}. \end{equation}

$\Theta= \{\theta^1, \dots, \theta^k\}$ is the complete set of parameters $\theta_k= \{\alpha_k, \mu^k, \Sigma^k\}$, where $\alpha_k$ are the priors (or mixing weights) of each Gaussian component, which satisfy the constraint $\sum_{k=1}^{K}\alpha_k = 1$.

GMMs are widely used in many areas of engineering, due to their modeling structure and flexibility they can be used for clustering, classification and regression purposes (we will cover these applications in TP5-GMM-Applications). The parameters $\mathbf{\Theta}$ can be estimated from training data using either a Maximum Likelihood parameter estimation approach through the iterative Expectation-Maximization (EM) algorithm or using Maximum A Posterior (MAP) estimation.

In this assignment, we will implement the EM-algorithm for GMM parameter learning, following the steps covered in the Continuous Distributions slides 48-49 from the Applied Machine Learning course. For the derivation of the EM steps, refer to annexes provided in the course EM Annexes.

Maximum Likelihood (ML) Parameter Estimation for GMMs¶

The aim of ML estimation is to find the model parameters $\mathbf{\Theta}$ which maximize the likelihood of the GMM given the training dataset $\mathbf{X}$. For a dataset of $M$ training data points, the GMM likelihood $\mathcal{L} (\Theta|\mathbf{X}) = p(\mathbf{X}|\Theta)$, assuming the data points are i.i.d (identically and independently distributed) is,

\begin{equation} p(\mathbf{X}|\Theta) = \prod_{i=1}^{M} p(\mathbf{x}^i|\Theta) = \prod_{i=1}^{M} \sum\limits_{k=1}^{K} \alpha_k p(\mathbf{x}^i | \mathbf{\mu}^k, \mathbf{\Sigma}^k). \end{equation}

Unfortunately, this equation is a non-linear function of the parameters $\Theta$ and direct maximization is impossible. However, an ML estimate can be obtained iteratively using a special case of the expectation-maximization (EM) algorithm which tries to find the optimum of the likelihood, which is equivalent to finding the optimum of the $\log$ likelihood

\begin{equation} \underset{\Theta}{\text{max}} \log\mathcal{L}(\Theta|\mathbf{X}) = \underset{\Theta}{\text{max}} \log p(\mathbf{X}|\Theta) \end{equation}

through the following steps:

  • Initialization step: Initialize priors $\mathbf{\alpha} = \{\alpha_1, \dots, \alpha_K \}$, means $\mathbf{\mu} = \{\mathbf{\mu}^1, \dots, \mathbf{\mu}^K \}$ and Covariance matrices $\mathbf{\Sigma} = \{\mathbf{\Sigma}^1, \dots, \mathbf{\Sigma}^K \}$.
  • Expectation Step: For each Gaussian $k \in \{1,\dots,K\}$, compute the probability that it is responsible for each point $\mathbf{x}^i$ in the dataset.
  • Maximization Step: Re-estimate the priors $\mathbf{\alpha} = \{\alpha_1, \dots, \alpha_K \}$, means $\mathbf{\mu} = \{\mathbf{\mu}^1, \dots, \mathbf{\mu}^K \}$ and Covariance matrices $\mathbf{\Sigma} = \{\mathbf{\Sigma}^1, \dots, \mathbf{\Sigma}^K \}$
  • Go back to step 2 and repeat until the $\log \mathcal{L}(\Theta|\mathbf{X})$ stabilizes.

Gaussian PDF and GMM Likelihood¶

We will begin by implementing two functions that are used throughout the EM and model fitting algorithms, these are:

  • gaussPDF.m: The pdf of a multivariate Gaussian function given a set of points $\{\mathbf{x}^1,\dots, \mathbf{x}^D\}$, a mean $\mathbf{\mu}$ and a Covariance matrix $\mathbf{\Sigma}$
  • gmmLogLik.m: The log likelihood of the parameters $\Theta$ of a learnt GMM for the given dataset $\mathbf{X}$

TASK 1: Implement gaussPDF.m function (2pts)¶

The function has the following signature:

In [ ]:
function [ prob ] = gaussPDF(X, Mu, Sigma)

To evaluate this function you should run the tp4-gmm-part1.m script. This will load a 2D Gaussian Dataset. The script allows you to select different dataset with the lines:

In [ ]:
% choose a dataset
dataset_list = {'1a', '1b', '1c'};
dataset = dataset_list{1}; % CHANGE HERE TO SWITCH DATASET

where you can change to 2 if you want to see dataset 1b. Be aware that the gaussPDF function is only evaluated with dataset 1a.

TASK 2: Implement gmmLogLik.m function (2pts)¶

The function has the following signature:

In [ ]:
function [ logl ] = gmmLogLik(X, Priors, Mu, Sigma)

To compute $\log\mathcal{L}(\Theta|\mathbf{X})$ we can reinterpret the log likelihood as:

\begin{equation} \begin{aligned} \log p(\mathbf{X}|\Theta) = & \log \Big( \prod_{i=1}^{M} p(\mathbf{x}^i|\Theta) \Big) \\ = & \sum_{i=1}^{M} \log \Big( p(\mathbf{x}^i|\Theta) \Big) \\ = &\sum_{i=1}^{M} \log \Big( \sum\limits_{k=1}^{K} \alpha^k p(\mathbf{x}^i | \mathbf{\mu}^k, \mathbf{\Sigma}^k)\Big). \end{aligned} \end{equation}

You can begin by computing the likelihood of each point $\mathbf{x}^i$ for each set of parameters $\{\mathbf{\mu}^k, \mathbf{\Sigma}^k\}$. Then compute the inner sum $(.)$ with the priors $\alpha_k$ and finally sum the $\log$-ed probabilities.

This function can be evaluated with either dataset 1a or 1b.

Type of Covariance Matrices¶

Apart from the number of clusters $K$, another flexible design decision in GMMs is the type of Covariance matrix $\mathbf{\Sigma}$ to use for each Gaussian component. Three types of $\mathbf{\Sigma}$ can be used:

Full Covariance matrix:¶

The full Covariance matrix is computed as follows:

\begin{equation} \mathbf{\Sigma}_{full} = \frac{1}{M-1} \mathbf{X}\mathbf{X}^{T}, \end{equation}

where $X$ is the zero-mean data ($X \rightarrow X - \bar{X}$), where $\bar{X}$ is the mean of the data. For a 2D dataset the form of $\mathbf{\Sigma}$ is:

\begin{equation} \mathbf{\Sigma}_{full} = \begin{bmatrix} \sigma_{x}^2 & \sigma_{x}\sigma_y \\ \sigma_{y}\sigma_x & \sigma_{y}^2 \ \end{bmatrix}, \quad \mathbf{\Sigma} \in \mathbb{R}^{x+y}. \end{equation}

Diagonal Covariance matrix:¶

A diagonal Covariance matrix of a dataset only considers the variance in the principal directions; i.e. disregarding the off-diagonal elements. For a 2D dataset it should have the following form:

\begin{equation} \mathbf{\Sigma}_{diag} = \begin{bmatrix} \sigma_{x}^2 & 0 \\ 0 & \sigma_{y}^2 \ \end{bmatrix}, \quad \mathbf{\Sigma} \in \mathbb{R}^{x+y} \end{equation}

This can be computed from full covariance matrix and extracting only the diagonal values.

Isotropic Covariance matrix:¶

Otherwise known as the circular Covariance matrix is solely dependent on the squared distance $|\mathbf{x}- \mathbf{x}'|^2$ between the original points (i.e. non-zero mean) and their mean $\mathbf{\mu}$. Hence, one must compute the isotropic variance as follows:

\begin{equation} \label{eq:iso} \mathbf{\sigma}_{iso}^2 = \frac{1}{NM} \sum\limits_{i=1}^{M} ||\mathbf{x}^i- \mathbf{\bar{x}}||^2. \end{equation}

Then replicate it in a diagonal matrix, to have the following form: \begin{equation} \mathbf{\Sigma}_{iso} = \begin{bmatrix} \mathbf{\sigma}_{iso}^2 & 0 \\ 0 & \mathbf{\sigma}_{iso}^2 \ \end{bmatrix}, \quad \mathbf{\Sigma} \in \mathbb{R}^{x+y}. \end{equation}

TASK 3: Implement compute_covariance.m function (3pts)¶

The function has the following signature:

In [ ]:
function [ Sigma ] = compute_covariance( X, X_bar, type )

with input $\mathbf{X}$, $\mathbf{\bar{X}}$ and type that can take the values full, diag, or iso. The output is $\mathbf{\Sigma} \in \mathbb{R}^{N\times N}$.

Within this code-block of tp4-gmm-part1.m we 3 figures corresponding to the contours of the first 3 standard deviations of the computed Covariance matrices, for dataset 1a:

In [7]:
visualize_covariances(X);

Expectation-Maximization for Estimating GMM Parameters¶

Step 1. Initialize Priors $\alpha^{(0)}$, Centroids $\mathbf{\mu}^{(0)}$ and Covariance Matrices $\mathbf{\Sigma}^{(0)}$¶

The EM algorithm for GMM begins with an initialization step. Here we shall initialize for iteration $t=0$ the following:

  • The set of priors $\alpha^{(0)} = \{\alpha_1^{(0)}, \dots, \alpha_K^{(0)}\}$ to uniform probabilities; i.e. $\alpha_1^{(0)}=\dots=\alpha_K^{(0)}=\frac{1}{K}$
  • The means $\mathbf{\mu}^{(0)} = \{\mathbf{\mu}^{1(0)}, \dots, \mathbf{\mu}^{K(0)} \}$ shall be initialized with the $K$-Means algorithm.
  • Given the means $\mu^{(0)}$ and labels computed from the previous step, one can initialize the set of corresponding Covariance matrices $\mathbf{\Sigma}^{(0)} = \{\mathbf{\Sigma}^{1(0)}, \dots, \mathbf{\Sigma}^{K(0)} \}$ choosing the datapoints assigned to each $\mathbf{\mu}^{k(0)}$ and using your compute_covariance function.

TASK 4: Implement gmmInit.m function (2pts)¶

The function has the following signature:

In [ ]:
function [ Priors0, Mu0, Sigma0, labels0 ] = gmmInit(X, params)

with input $\mathbf{X}$, and $params$, a structure that contains all the hyperparameters for both the k-means initialization and the gmm algorithm, i.e. $k$ number of clusters, $cov\_type$ the type of covariance matrix among full, iso, and diag, $d\_type$ the metric function and $max\_iter\_init$ the maximum number of iterations for the initialization. The output should be $\alpha^{(0)} = \{\alpha_1^{(0)}, \dots, \alpha_K^{(0)}\}$, $\mathbf{\mu}^{(0)} = \{\mathbf{\mu}^{1(0)}, \dots, \mathbf{\mu}^{K(0)} \}$ and $\mathbf{\Sigma}^{(0)} = \{\mathbf{\Sigma}^{1(0)}, \dots, \mathbf{\Sigma}^{K(0)} \}$. The functions needed for the k-means algorithm are given in the utils folder.

In the main script, for dataset 1b, by modifying K and cov_type you can visualize the different initializations. Please note that due to the randomness of the initialization your graph might slightly differ.

In [9]:
params.cov_type = 'full'; params.k = 4;
[ Priors0, Mu0, Sigma0, labels0 ] = gmmInit(X, params);
plot_gmm(X, Priors0, Mu0, Sigma0, params, 'Initial Estimates');
In [10]:
params.cov_type = 'full'; params.k = 5;
[ Priors0, Mu0, Sigma0, labels0 ] = gmmInit(X, params);
plot_gmm(X, Priors0, Mu0, Sigma0, params, 'Initial Estimates');
In [11]:
params.cov_type = 'diag'; params.k = 4;
[ Priors0, Mu0, Sigma0, labels0 ] = gmmInit(X, params);
plot_gmm(X, Priors0, Mu0, Sigma0, params, 'Initial Estimates');
In [12]:
params.cov_type = 'diag'; params.k = 5;
[ Priors0, Mu0, Sigma0, labels0 ] = gmmInit(X, params);
plot_gmm(X, Priors0, Mu0, Sigma0, params, 'Initial Estimates');

Step 2. Expectation Step: Membership probabilities¶

At each iteration $t$, estimate, for each Gaussian $k$, the probability that this Gaussian is responsible for generating each point of the dataset. The a posteriori probability for a $k$-th component is given by

\begin{equation} p(k|\mathbf{x}^i, \Theta^{(t)})= \frac{\alpha_k^{(t)}p(\mathbf{x}^i | \mathbf{\mu}^k,\mathbf{\Sigma}^k )}{\sum_{j=1}^K \alpha_j^{(t)} p (\mathbf{x}^i| \mu^{j(t)}, \Sigma^{j(t)})}. \end{equation}

These probabilities are the output of the expectation Step. They must be computed for each $k \in \{1,\dots,K\}$ for all data points $i\in \{1,\dots,M\}$.

TASK 5.1: Implement expectation_step.m function (2pts)¶

The function has the follosing signature:

In [ ]:
function [Pk_x] = expectation_step(X, Priors, Mu, Sigma, params)

It takes as input $\mathbf{X}$, the current priors $\alpha = \{\alpha_1, \dots, \alpha_K\}$, centroids $\mathbf{\mu} = \{\mathbf{\mu}^{1}, \dots, \mathbf{\mu}^{K} \}$, and covariance matrices $\mathbf{\Sigma} = \{\mathbf{\Sigma}^{1}, \dots, \mathbf{\Sigma}^{K} \}$, and the hyper-parameter structure params. It outputs the posterior probability $p(k|\mathbf{x}^i, \Theta^{(t)})$ of each Gaussian $k$.

Step 3. Maximization Step: Update Priors $\mathbf{\alpha}$, Means $\mathbf{\mu}$ and Covariances $\mathbf{\Sigma}$¶

In order to maximize the log-likelihood of the current estimate we update the priors $\alpha^{(t+1)} = \{\alpha_1^{(t+1)}, \dots, \alpha_K^{(t+1)}\}$ with the following equation:

\begin{equation} \alpha_k^{(t+1)} = \frac{1}{M}\sum\limits_{i=1}^{M}p(k|\mathbf{x}^i, \Theta^{(t)}) \end{equation}

The means are updated by the following equation:

\begin{equation} \mathbf{\mu}^{k(t+1)} = \frac{\sum\limits_{i=1}^{M} p(k|\mathbf{x}^i, \Theta^{(t)}) \mathbf{x}^i}{\sum\limits_{i=1}^{M} p(k|\mathbf{x}^i, \Theta^{(t)})}. \end{equation}

Finally, the Covariance matrices for each $k$-th component are computed with the following equation:

\begin{equation} \mathbf{\Sigma}^{k(t+1)} = \frac{\sum\limits_{i=1}^{M} p(k|\mathbf{x}^i, \Theta^{(t)}) (\mathbf{x}^i - \mathbf{\mu}^{k(t+1)})(\mathbf{x}^i - \mathbf{\mu}^{k(t+1)})^T}{\sum\limits_{i=1}^{M} p(k|\mathbf{x}^i, \Theta^{(t)})}. \end{equation}

For full Covariance matrices the previous equation can be used directly to update the $\mathbf{\Sigma}^{(t+1)}$. For diagonal Covariance matrices, you can compute the full version and then simply extract the diagonal elements. Now, for isotropic matrices, one must use the following update equation:

\begin{equation} \mathbf{\Sigma}_{iso}^{k(t+1)} = \frac{\sum\limits_{i=1}^{M} p(k|\mathbf{x}^i, \Theta^{(t)}) ||\mathbf{x}^i - \mathbf{\mu}^{k(t+1)}||^2}{N\sum\limits_{i=1}^{M} p(k|\mathbf{x}^i, \Theta^{(t)})}. \end{equation}

Every so often, during EM iterations, certain Covariance matrices can become ill-conditioned, i.e. the matrix tends to be a singular matrix. In the likelihood sense, this means that the likelihood is converging towards infinity. This happens for various reasons; namely (i) there might be more parameters than datapoints or (ii) the variables are highly correlated. Hence, after estimating the $\mathbf{\Sigma}$'s we must add a tiny variance to avoid this numerical instability, this can be done by adding a very very small number $\epsilon=1e^{-5}$ to the diagonal values of each $\mathbf{\Sigma}^k$ .

TASK 5.2: Implement maximization_step.m function (2pts)¶

This function has the following signature:

In [ ]:
function [Priors,Mu,Sigma] = maximization_step(X, Pk_x, params)

It takes as input $\mathbf{X}$, previously calculated posterior probability $p(k|\mathbf{x}^i, \Theta^{(t)})$ and the hyper-parameter structure params. It outputs the updated priors $\alpha = \{\alpha_1, \dots, \alpha_K\}$, centroids $\mathbf{\mu} = \{\mathbf{\mu}^{1}, \dots, \mathbf{\mu}^{K} \}$, and covariance matrices $\mathbf{\Sigma} = \{\mathbf{\Sigma}^{1}, \dots, \mathbf{\Sigma}^{K} \}$.

Putting it all together¶

The final step task is now to implement a function that iterates over Step 2 (E-Step) and Step 3 (M-Step) until the log likelihood of current parameters $\Theta$ is stabilized, given some initial $\Theta^{(0)}$, $\alpha^{(0)}$, $\mathbf{\mu}^{(0)}$, and $\mathbf{\Sigma}^{(0)}$, computed from gmmInit function.

TASK 5.3: Implement gmmEM.m function (2pts)¶

This function has the following signature:

In [ ]:
function [  Priors, Mu, Sigma, iter ] = my_gmmEM(X, params)

It takes as inputs $\mathbf{X}$, and the hyper-parameters structure params. The outputs should be the final priors $\alpha = \{\alpha_1, \dots, \alpha_K\}$, centroids $\mathbf{\mu} = \{\mathbf{\mu}^{1}, \dots, \mathbf{\mu}^{K} \}$, Covariance matrices $\mathbf{\Sigma} = \{\mathbf{\Sigma}^{1}, \dots, \mathbf{\Sigma}^{K} \}$ and iter, which is the number of iterations your implementation took to converge to the given solution.

To evaluate this function you should initially load the 1b dataset. By running the script you can visualize the initial $\mathbf{\mu}$'s and $\mathbf{\Sigma}$'s, and the final $\alpha$'s, $\mathbf{\mu}$'s and $\mathbf{\Sigma}$'.

In [7]:
plot_gmm(X, Priors, Mu, Sigma, params, 'Final Estimates for EM-GMM');
In [9]:
ml_plot_gmm_pdf(X, Priors, Mu, Sigma)

Part 2: Fitting Gaussian Mixture Models¶

From the GMM equations, one can observe that the set of model parameters are $\{K, (\alpha^k, \mathbf{\mu}^k, \mathbf{\Sigma}_k)_{k=1}^K\}$. As in k-means, we can use the AIC and BIC metrics for model selection. To recall:

  • AIC: The AIC metric is a maximum-likelihood measure that penalizes for model complexity as follows:
\begin{equation} AIC = -2\ln \mathcal{L} + 2B \end{equation}

where $\mathcal{L}$ the likelihood of the model and $B$ the total number of model parameters.

  • BIC: The BIC metric goes even further and penalizes for number of datapoints as well with the following equation as follows:
\begin{equation} BIC = -2\ln \mathcal{L} + \ln(M)B \end{equation}

where $M$ is the total number of datapoints.

For GMMs the computation of the total model parameters $B$ is more involved than k-means. For a dataset with $\mathbf{x} \in \mathbb{R}^N$, the total number of parameters to learn for a $K$-component GMM with full Covariance matrix is:

\begin{equation} B_{full} = K \times (1 + 2N + \frac{1}{2}N \times (N - 1))-1 \end{equation}

with $-1$ corresponding to the priors constraint $\sum_{k=1}^{K}\alpha^k = 1$. For the case of diagonal Covariance matrices,

\begin{equation} B_{diag} = K \times (1 + N + N )-1 \end{equation}

and for isotropic Covariance matrices the number of parameters is computed as follows:

\begin{equation} B_{iso} = K \times (1 + N + 1) - 1 \end{equation}

TASK 6: Implement gmm_metrics.m function (3pts)¶

The function has the following signature:

In [ ]:
function [AIC, BIC] =  gmm_metrics(X, Priors, Mu, Sigma, cov_type)

It takes as input the results of the gmmEM function, i.e. the fitted GMM. cov_type is a string that stipulates which covariance matrix type was used to compute the GMM. Note that there are no evaluation functions provided for this task.

Choosing the optimal $K$¶

For choosing the optimal $K$ that best describes our dataset with a GMM, we follow the same procedure as in k-means. We estimate the GMM parameters for a range of $K$ values, 10 times each. For each $K$ we select the best run, which in likelihood terms, means the run with the maximum likelihood; in other words, we choose the minimum AIC/BIC value from the 10 runs. We then plot the values and select the $K$ which yields the best tradeoff between likelihood and model complexity.

TASK 7: Implement gmm_eval.m function (2pts)¶

The function has the following signature:

In [ ]:
function [AIC_curve, BIC_curve] =  gmm_eval(X, K_range, repeats, params)

It takes as input the range of $K$ values to consider, the number of repetitions and the structure of hyperparamters. Note that there are no evaluation functions provided for this task.

You can visually evaluate your function by modifying cov_type in the script tp4-gmm-part2.m. You should obtain plots similar to the following for the 1a dataset:

In [11]:
plot_curves(AIC_curve, BIC_curve, params)
In [13]:
plot_curves(AIC_curve, BIC_curve, params)
In [15]:
plot_curves(AIC_curve, BIC_curve, params)

By simply looking at the figure for the full covariance, one can quickly infer that $K=3$ and this type of covariance is the best parametrization for the data 1a dataset:

In [30]:
plot_gmm(X,Priors,Mu,Sigma,params,strcat(['Final Estimates ',params.cov_type,' covariance, k=',int2str(params.k)]));

The 1a dataset is somewhat ideal for model selection, as it was generated from a GMM itself. However, these metrics tend to yield different results for data that is not clearly described by a mixture of Gaussians or is not multi-modal. Take for example the data in 1b dataset, there is no clear elbow or "optimal" point for the range of $K=1:10$:

In [18]:
plot_curves(AIC_curve, BIC_curve, params)
In [20]:
plot_curves(AIC_curve, BIC_curve, params)
In [24]:
plot_curves(AIC_curve, BIC_curve, params)

The AIC metric always tends to the maximum number of clusters in these cases however, we can see that the BIC does penalize for model complexity. By analyzing the BIC curve for the isotropic types of GMM, we can see that, in fact, the "optimal" K is 1, which would be the easiest representation of this particular dataset without considering the labels:

In [26]:
plot_gmm(X,Priors,Mu,Sigma,params,strcat(['Final Estimates ',params.cov_type,' covariance, k=',int2str(params.k)]));

If we had solely done model selection with a full Covariance matrix, one might have chosen $K=8$.

In [28]:
plot_gmm(X,Priors,Mu,Sigma,params,strcat(['Final Estimates ',params.cov_type,' covariance, k=',int2str(params.k)]));

Another way of evaluating your models when the AIC/BIC curves are not that informative, one can monitor the increase in the Likelihood of the model and select the "optimal" $K$ once the likelihood stabilizes.