Exercise session¶

This session is a non graded assignment designed to prepare you for the rest of the course. Throughout the following exercises you will learn how to handle, visualize and clean data to be used in your machine learning algorithms. Those exercises will also introduce you with some vocabulary that you will see and use during this course and in career as a machine learning expert. Those words will be highlighted in bold.

Remember to ask questions if you are stuck, teaching assistants are there for you. Also, most of the answers can be found on matlab documentation so use it extensively.

Part 1: Data, data everywhere¶

If one word could summarize machine learning it would sure be data. Without data, no learning!

But what are data then. Well, simply put, everything. From the word you type in your navigator to a tap on your smartphone every numerical actions can be considered as data and used in machine learning. However, some data requires preprocessing, i.e transforming, cleaning the data beforehand.

Data samples are stored in a dataset, a collection of elements often represented as a table of multiple lines and columns. It is common practice that lines represent samples and columns represent features of a sample, i.e its characteristic such as measurements, age, phone number,...

We will start with a very simple dataset, the fisher iris flower dataset.

Loading a dataset is usually straightforward as most of them are stored in a standard format like csv files or matlab mat files. For the iris dataset, Matlab even propose to access it in a single command line.

  • Start with creating a new script in Matlab New > Script and write the following commands:
In [ ]:
clc;
clear;
close all;

Those will clear all variables and close all open windows each time you run the script. Remember to always run a script that is in your Matlab path. You can run the script by hitting the run button at the top.

  • Add the following command line to your script to load the iris dataset:
  • Note: this command requires the Statistics and Machine Learning Toolbox to be installed.
In [1]:
load fisheriris

Voila! You have your first data on hand. If you have open the Wikipedia page on the iris dataset, you know that it contains measurements of some iris flowers.

On the right side in the tab called Workspace you can see that the command has created two variables: meas and species. Let us look inside those:

In [ ]:
meas

Writing the name of a variable in your script without ; at the end will display it. You can also use the command:

In [ ]:
disp(meas);

As an advice, always put a ; after each Matlab commands. If you want to display a variable, prefer the usage of the disp function. You can also use more advanced printing methods to display some information.

You can write commands directly in the Command Window at the bottom to make some quick tests that you don't want to add to your scripts.

Now, have a look at the two variables and try to understand what they represent. Specifically, ask yourself:

  • How many samples does the dataset contain?
  • How many features?
  • Is there a relationship between meas and species?

You should probably have responded yes to the last questions especially if you have noticed that both tables contain the same number of lines. We call meas a feature table and species a label table. To each sample of meas is associated a label in species.

One goal of machine learning, as you will see in the next weeks, will be to find those relationships between the features and the label(s) of samples. For example, you will be able to answer the following questions:

  • Knowing the measurements of a flower can I predict its species?
  • Can I find the most likely sepal width of an Iris setosa knowing its sepal length?

Those are questions that you can answer by scrapping the data, and building some models. But today you will see how to manipulate data in more extensive ways.

Part 2: Data Manipulation¶

Manipulating data is not forging data. Machine learning algorithms expect data in a certain format. It could be from the dimensions expected at the input of a function to the values contained in an array. When handling and modifying data to your usage you should be extra carefull not to break relationships between samples and conserve the coherence of the dataset.

Accessing data¶

Let us first define a function that take the index value of a sample and return its features and label. To define a function you have to give it a name, some inputs and enventually some outputs. The combination of name, inputs and outputs of a function represent its signature. Each function need to be written in its own file.

  • Start by creating a function file New > Function and give it the following signature:
In [ ]:
function [feature, species] = pick(i, features_table, species_table)

This function will return the i-th element of features_table and species_table, i.e. the i-th sample of the dataset.

  • Write its content by having a look at array indexing. Then you can call it in your script and you should see the following output:
In [4]:
[X, y] = pick(10, meas, species)
X =

    4.9000    3.1000    1.5000    0.1000


y =

  1x1 cell array

    {'setosa'}

It is common practice to store features in a table named X and labels in y. Small or capitals letters usually refer to the fact that the sample data is a vector or a single element. But feel free to give more explicit names.

One very good feature of indexing in Matlab is that you can give an array of indexes and it will still work as expected.

  • Try the following command and observe the results:
In [ ]:
[X, y] = pick([1,2,3,4], meas, species);

As you have observed, this returns the first four elements of your dataset. So remember, you can always use arrays when indexing in Matlab and this will be very useful for the rest of this lecture.

Now we will go a bit further. Try to find the input arrays that will allow you to:

  • Store the whole dataset in X and y
  • Store each elements of odd indexes

One hint, if constructing those arrays require you to do some loops, you are not doing it right. Those should be a one liner. As a matter of fact, remember one very important fact, Matlab HATES loops as they are slow to execute. As much as possible, try to not write loops. Prefer using array indexing and matrix multiplications that are optimized for performance.

Indexing with array has also a very interesting feature, conditional indexing. You can store in an array the results of a logic expression and use it as input of your function. When doing this, your index array will represent a logic array comprising only 0 and 1 values. Matlab will automatically select the data whose index in the logic array have a 1 value.

For example, one solution for the previous question on odd numbers, certainly different than the one you have found but makes use of this feature, could be:

In [ ]:
idx = (repmat(1:2,1,75) == 1);
[X, y] = pick(idx, meas, species);

Use this feature to retrieve the following samples:

  • All the flowers of sepal width, i.e. second feature of meas, greater than 3
  • All the versicolor flowers

You now have all you need to access data from a Matlab matrix or array.

Sorting and shuffling data¶

If you look at the species table you can see that it is ordered by species. One common practice is to first shuffle it to break that ordering in order to avoid introducing bias in the learning process. Matlab does not provide a shuffle function per se but we can use array indexing as we did previously.

In order to generate a random array of indexes you can use the function randperm:

In [ ]:
idx  = randperm(length(meas));

This creates on array of randomly ordered number from 1 to the lenght of the dataset. Now you can use this to shuffle both the species and meas arrays:

In [ ]:
X = meas(randperm(length(meas)),:);
y = species(randperm(length(species)),:);
  • Can you tell why those previous two lines are absolutely wrong?

If not, please refer to the comment in the introduction of Part 2 about NOT breaking the relationships between samples and labels. It is crucial that you understand what is wrong here. If yes, can you do it properly in your script?

Now that our data are shuffled why not trying to put them back in order? We can use the sort function for that:

In [ ]:
y = sort(y);
X = sort(X);
  • Again, is there something wrong here? Do you see why it is ever worse than previously in terms messing up with the data? Can you find from the documentation of the sort function a way to do that properly?

One way of sorting the data could be to sort them per label as they were before when we open the dataset. After sorting the data we can check that everything is as before. One way to check this is to run the following:

In [ ]:
(X-meas) < 1e-4;

If our new data is sorted as before, this test should put return true for every sample of our dataset. This < 1e-4 could also be replaced by == 0. However, it is better to use inequalities as numerical roundups could cause the differences to not be exactly equal to 0.

  • What do you observe? Do you get the values expected? Probably not...

Putting back the values as they were is not possible with the simple sort function and the data as we have them. The reason for that, is that sorting by label do not guarantee that the first setosa of the original dataset will still be sorted as such. If we want to do that, we will need to add an extra column, either in the features or in the labels that keep track of the initial indexes. Let us add this column is the feature set:

In [ ]:
X = [[1:length(meas)].', meas];

The .' operator is the shortcut for transpose. You can check its importance here by trying to remove it. Note that you can also use ' for the same result. But have look in the documentation to understand the difference between them. You will see that in our case it makes no difference in the results, however, it is recommanded to take the habit of using the proper one.

  • Now try to use the new added column to shuffle and sort back the values to their initial state.

With those functions you have everything you need for the next section which covers how to slice the dataset.

Data slicing¶

Data slicing, partitionning, or splitting are synonyms and refer to how you can separate a dataset in parts. You will see throughout the lectures that this is particularly important. For now let us just assume that we want to cut the dataset into two portions of different sizes.

  • Create a function with the following signature:
In [ ]:
function [X1, y1, X2, y2] = split_dataset(features, labels, ratio)

In ratio we will put the desired size of the first cut in terms percentage of the whole dataset. For example, ratio = 0.5 will mean that we want to cut the dataset in two part of equal size.

  • Write the content of the function using concepts and functions we have seen previously. As an additional constraint, we want each slice to be shuffled.

It is important to always check that the output we get are coherent. Here, for example, we should check that:

In [ ]:
length(X1) == ratio * length(meas)
In [ ]:
length(X1) == ratio * length(meas)
(sort(meas) - sort([X1;X2])) < 1e-4

Note that, here, the order is of no importance for this specific test. We just want to check that we have not lost or duplicated any samples on the way. We can even use a very handful function, assert:

In [ ]:
assert(length(X1) == ratio * length(meas))
assert(all((sort(meas) - sort([X1;X2])) < 1e-4))

Now depending on how you have implemented the splitting, you might want to test the following:

In [ ]:
assert(sum(y1 == "setosa") == ratio * length(meas) / 3);

It is sometime useful to keep the same ratio of labels in each slice of the dataset as in the whole. Here the dataset contains 3 types of flowers of 50 sample each. Therefore, if we cut the dataset in two (ratio = 0.5) we might want to have 25 flowers of each type, in each slice of the dataset.

  • Try to modify your split_dataset function to satisfy the previous assert.

Conclusion¶

Congratulations you have finished this part 1 of the exercise session. In this first part, we have seen how to start with handling dataset and splitting them in slices of different size. Next part will cover some more in depth analysis on the data on hand and preprocessing them to satisfy specific requirements.