
File: 02-outlier.py


Michel Bierlaire

Tue Aug 13 11:01:40 2024



In [None]:

import pandas as pd
import biogeme.database as db
import biogeme.biogeme as bio
from IPython.core.display_functions import display
from biogeme.expressions import Beta, Variable
from biogeme.models import loglogit, logit


The objective of this laboratory is to illustrate the outlier analysis. We use the Optima case study, for
transportation mode choice in Switzerland.

Read the data

In [None]:
df = pd.read_csv('optima.dat', sep='\t')
display(df)


Prepare the data for Biogeme

In [None]:
database = db.Database('optima', df)


Identification of the relevant variables.

In [None]:
Choice = Variable('Choice')
Weight = Variable('Weight')
age = Variable('age')
Gender = Variable('Gender')
TimePT = Variable('TimePT')
TimeCar = Variable('TimeCar')
TripPurpose = Variable('TripPurpose')
MarginalCostPT = Variable('MarginalCostPT')
CostCarCHF = Variable('CostCarCHF')
distance_km = Variable('distance_km')
Education = Variable('Education')
LangCode = Variable('LangCode')
OccupStat = Variable('OccupStat')
NbTransf = Variable('NbTransf')
Income = Variable('Income')
WaitingTimePT = Variable('WaitingTimePT')
CarAvail = Variable('CarAvail')
Subscription = Variable('Subscription')
GenAbST = Variable('GenAbST')
OwnHouse = Variable('OwnHouse')
NbBicy = Variable('NbBicy')


Removing some incorrectly coded data.

In [None]:
exclude = (
    (age == -1)
    + (Choice == -1)
    + (CostCarCHF < 0)
    + (Income == -1)
    + (CarAvail == 3) * (Choice == 1)
) > 0
database.remove(exclude)



We first estimate a simple model:
\begin{align*}
V_\text{PT} &= \text{Cte}_\text{PT} + \beta_{t, \text{PT}} \text{Time}_\text{PT}
+ \beta_{c, \text{PT}} \text{Cost}_\text{PT}, \\
V_\text{Car} &= \text{Cte}_\text{Car} + \beta_{t, \text{Car}} \text{Time}_\text{Car}
+ \beta_{c, \text{Car}} \text{Cost}_\text{Car}, \\
V_\text{SM} &= \beta_d \text{distance}.
\end{align*}

Parameters to be estimated.

In [None]:
ASC_PT = Beta('ASC_PT', 0, None, None, 0)
ASC_CAR = Beta('asc_car', 0, None, None, 0)
BETA_TIME_PT = Beta('beta_time_pt', 0.0, None, None, 0)
BETA_TIME_CAR = Beta('beta_time_car', 0.0, None, None, 0)
BETA_COST = Beta('BETA_COST', 0, None, None, 0)
BETA_DISTANCE = Beta('BETA_DISTANCE', 0, None, None, 0)


Utility functions

In [None]:
V_PT = ASC_PT + BETA_TIME_PT * TimePT + BETA_COST * MarginalCostPT
V_CAR = ASC_CAR + BETA_TIME_CAR * TimeCar + BETA_COST * CostCarCHF
V_SM = BETA_DISTANCE * distance_km

V = {0: V_PT, 1: V_CAR, 2: V_SM}


Availability conditions.

In [None]:
av = {0: 1, 1: CarAvail != 3, 2: 1}


Estimation of the parameters

In [None]:
logprob = loglogit(V, av, Choice)
biogeme = bio.BIOGEME(database, logprob)
biogeme.modelName = 'logit_optima_base'
results = biogeme.estimate()


General statistics.

In [None]:
print(results.print_general_statistics())


Estimated parameters

In [None]:
display(results.get_estimated_parameters())



We now simulate the estimated model to obtain the contribution of each observation to the likelihood function. It is
the probability predicted by the model to choose the actually chosen alternative.

In [None]:
prob_chosen = logit(V, av, Choice)


We define a dictionary with the formulas to be simulated. Here, there is only one.

In [None]:
simulate = {'Prob. chosen': prob_chosen}


We perform the simulation.

In [None]:
biosim = bio.BIOGEME(database, simulate)
betas_values = results.get_beta_values()
sim_results = biosim.simulate(the_beta_values=betas_values)
display(sim_results)



Consider as outliers all observations such that the choice probability is less than 10%.

- Investigate the data in order to understand why the model is performing so poorly on those observations.
- Use your conclusions to improve the model specification.