File 01-descriptive.py


Michel Bierlaire

Wed Aug 7 18:10:49 2024




Before using a data file for modeling purposes, it is important to
collect some information about its content. The objective of this lab is to extract some descriptive statistics
from a database with choice data using the package `pandas`.

We introduce some examples using the file `swissmetro.dat`. 

We first import `pandas`

In [None]:
import pandas as pd
from IPython.core.display_functions import display
from matplotlib import pyplot as plt


The data file is available at
[http://transp-or.epfl.ch/data/swissmetro.dat](http://transp-or.epfl.ch/data/swissmetro.dat).

The
description of the columns of the file is
available [here](http://transp-or.epfl.ch/documents/technicalReports/CS_SwissmetroDescription.pdf).

Read the file. For future laboratories, it is advised to download the file and store it on your local disk. Here,
we will obtain it from its URL.

In [None]:
data_file = 'http://transp-or.epfl.ch/data/swissmetro.dat'
swissmetro = pd.read_csv(data_file, sep='\t')
display(swissmetro)



- The database contains 10728 rows of data, corresponding to each
observation in the sample.
- It contains 28 columns, corresponding to
the available variables.

The list of columns is reported below.

In [None]:
display(swissmetro.columns)


If we look at the column `ID`, we observe that it contains 1192
unique values, corresponding to the 1192 individuals that have
participated in the survey. Each of these respondents was asked to
perform 9 choice exercises, for a total of 10728 observations (the
number of rows in the file).

In [None]:

display(swissmetro['ID'].unique())


If we look at the column `PURPOSE`, corresponding to the trip
purpose, it contains a total of 9 unique values, numbered from 1 to
9. 

In [None]:
display(swissmetro['PURPOSE'].unique())


In order to understand better the distribution of these values, we
can calculate the frequency of each value, here sorted by decreasing
order of frequency.

In [None]:
display(swissmetro['PURPOSE'].value_counts())


The histogram of this distribution is also useful.

In [None]:
_ = swissmetro['PURPOSE'].value_counts().plot(title='PURPOSE', kind='bar')
plt.show()


We do the same for the `CHOICE`variable.

In [None]:
display(swissmetro['CHOICE'].value_counts())


And the histogram...

In [None]:
_ = swissmetro['CHOICE'].value_counts().plot(title='CHOICE', kind='bar')
plt.show()


If we look at the column `INCOME`, we note that it is also
coded as a discrete variables, with 5 unique values, distributed as follows.

In [None]:
swissmetro['INCOME'].value_counts()



And we can represent the histogram using horizontal bars. 

In [None]:
_ = swissmetro['INCOME'].value_counts().plot(title='INCOME', kind='barh')
plt.show()


If we look at a continuous variable, such as `TRAIN_TT`,
representing the travel time by train, we are interested in statistics
such as the mean, the standard deviation, the minimum and maximum
values, as well as some quantiles.

In [None]:

display(swissmetro['TRAIN_TT'].describe())



It is interesting to note that 75\% of the values are lesser or equal
to 209, while the maximum is 1049. 

A histogram can also be plotted.

In [None]:
_ = swissmetro['TRAIN_TT'].hist()



A similar analysis of the variable `SM_CO` provides the
following statistics.

In [None]:
display(swissmetro['SM_CO'].describe())
_ = swissmetro['SM_CO'].hist()
plt.show()


It may be made more readable by using a log scale.

In [None]:
_ = swissmetro['SM_CO'].hist(log=True)
plt.show()


It is also interesting to investigate the correlation between
two variables.

In [None]:
display(swissmetro['TRAIN_TT'].corr(swissmetro['TRAIN_CO']))



The correlation can also be illustrated using a scatter plot.

In [None]:
_ = swissmetro.plot(kind='scatter', x='TRAIN_TT', y='TRAIN_CO', color='r')
plt.show()


Now, you are asked to perform a similar analysis of the file
[http://transp-or.epfl.ch/data/optima.dat](http://transp-or.epfl.ch/data/optima.dat). The description of the data is
available [here](http://transp-or.epfl.ch/documents/technicalReports/CS_OptimaDescription.pdf).