exercise_part2

This session is a non graded assignment designed to prepare you for the rest of the course. Throughout the following exercises you will learn how to handle, visualize and clean data to be used in your machine learning algorithms. Those exercises will also introduce you with some vocabulary that you will see and use during this course and in career as a machine learning expert. Those words will be highlighted in bold.

Remember to ask questions if you are stuck, teaching assistants are there for you. Also, most of the answers can be found on matlab documentation so use it extensively.

In the first session we have seen how to read, store and handle data properly. In this session we will see how to prepare them for machine learning algorithms.

Part 3 Categorical and numerical data¶

Most dataset use two type of data, numerical and categorical. Numerical data are numbers, e.g. the age of a person, and do not require special care to be used in your algorithms. Categorical data on the other hand, most often represented as strings, can not be process as is, especially if combined with numerical data. Let us start again with the iris flower dataset.

Remember the species array? It contains string data representing the species of the iris flower. Converting categorical data is simple, we just need to transform each unique string to a number. This number can be arbitrary but we usually give an index ranging from 1 to the number of unique categories. First we need to get the number of unique categories. For this we can use the unique function:

Then we can replace each categories by a number, setosa being 1 and virginica 3. As the species array is a cell array, it can be tricky to handle. Extracting the categorie as a string and not a cell array is done using:

The {} operator is the default to use with cell array. You should always use it instead of the () operator. Let us try to apply this.

Write a function that convert any categorical array to a numerical array of corresponding indexes

This function should output a numerical array containing only numbers from 1 to the number of unique elements in the input categorical array. Got it working? Again you should test your output. First you should only see numbers ranging from 1 t 3 if applied on species.

The function grp2idx, which MATLAB provides for this purpose, is used in the assertion below. It can be replaced with one's own function.

In this example, the function used is named grp2idx but yours might be named differently. The second thing to check is that the correspondance between the original categorcial array and this new numerical array.

This assert might be a bit confusing to understand. So make sure you break it down elements by elements if you want to see how it works. By the way did you write the function yourself? Have you tried looking online first if such a function was not already existing? You might want to look for the grp2idx function used here.

As a side note, Matlab provides a lot of already implemented function that you can use. Unless stated otherwise for the sake of your training, you will never have to rewrite everything from scratch. A good machine learning expert, and of course by extension a good programmer always reuse existing code and search for optimized functions to solve his/her problem. If you don't know them, here is a list of websites you should bookmark to search for those:

Matlab documentation: ok this is an obvious one and you sould already have visited it at least once.
stackoverflow: the doctissimo of programming, minus the dumb questions and health advices. Note that StackExchange that hosts stackoverflow also include similar forum dedicated to specific disciplines such as mathematics.
Kaggle: originally a competition website for data scientists and machine learning experts, but now contains a lot of sample code to use and data to try your algorithms on.
Medium: slowly becoming one of the biggest news website in the tech and science field but not only. Just look for machine learning articles if you are bored or out of ideas to try. They also host a specific webiste for data scientists towardsdatascience.

Alright enough of links and going back to manipulating data. Now it is time for visualization.

Part 4 Data visualization¶

This is probably the most important part and somehow the most forgotten one, visualization. Understanding the relations between data and extracting patterns just by looking at numbers is tedious. We prefer figures, graphics and co. First let us load our prepared data:

Before going further, we need a better understanding of the dimensions of the arrays.

Find the function that output the dimensions of each array

This function should output 150 4 for X and 150 1 for y. This already give you some insights on the data. X contains 4 features and 150 samples as you already know. Let us say we want to visualize the features of all the samples, per flower category. Because plots are easier to see in 2D we will need to visualize the features in a pairwise way, i.e. by selecting two features to plot against each other on the x- and y- axis, respectively. In geometric terms, we will consider the feature vector x as the four-dimensional coordinates of each sample, and then we will project each sample's coordinates onto a plane spanned by two coordinate axes.

First we create a vector of string containing the labels of each features and a permutation index containing all the permutations of two integers among four. MATLAB provides the function combnk for this purpose.

The next step is to extract the groups of flowers based on the y vector. You already did that last week.

Extract all the features corresponding to a specific type of flowers and store the results in Xsetosa, Xversicolor and Xvirginica

You can now use the following commands to plot

Now with that we visualize the whole dataset by pair of two features

Write a function that plot the 6 figures corresponding to visualization of each features taken 2 by 2

Your function should provide the following result:

Interestingly, you can see that by looking at those plots you could define some logic rules that tells you how to classify the flowers, i.e. finding their species only by looking at the features.

Can you find at least one of those rules?

Another useful figure is the histogram wich count the number of samples in ranges of the selected features. For example the following commands show the histogram of each species for the first feature.

Write a function that plots the histograms of all the features

This function should output the following:

This is again very useful. As we can see the setosa species has a distinctive petal length and width compared to the other species. This means that we can easily train a classifier that will predict if the flower belong to the setosa family only by looking at those, and even a single one of those, features.

Another way to look at this is to use boxplot plots. Those are interesting as they give some insights on wether or not the differences between species on a specific feature could be significant or not.

Here we can use directly the species array as it contains directly the labels as string. If you use y array you will have 1 2 3 as labels.

Write a function that show the boxplots of all the features

This function should output the following:

Boxplots give an idea if the differences between groups are significant or not. Despite having different means, overlapping boxplots might suggest that the differences are actually not significant. Here we can see that setosa flowers have a petal width and petal length significantly different from the two others as the boxplots are well separated. Therefore, we can be confident that training a classifier that relies on those two features will be quite reliable.

Voila! We have covered the basis of data visualization. Obviously this is valid only for categorical data in the y vector. Throughout the course you will also see examples where the y vector contains continuous numerical values. Visualization in this case requires some proper care that we will not cover today.

This concludes the exercise session. In this parts we have covered the basis for handling data and visualizing them. You have now all the basic tools in your hands to implement machine learning algorithms and start your journey towards being a machine learning developer.

In the next course, you will have your first graded assignment on Principal Component Analysis.

Exercise session¶

Part 3 Categorical and numerical data¶

Part 4 Data visualization¶

Conclusion¶