
File: 04-wooldridge.py


Michel Bierlaire

Sun Aug 25 18:45:36 2024



In [None]:

import pandas as pd
import biogeme.biogeme_logging as blog
from IPython.core.display_functions import display
from biogeme.biogeme import BIOGEME
from biogeme.database import Database
from biogeme.expressions import (
    Beta,
    Variable,
    bioDraws,
    PanelLikelihoodTrajectory,
    MonteCarlo,
    log,
    Expression,
)
from biogeme.models import logit


Consider the estimation of the dynamic choice model with panel effects that was performed in the previous laboratory,
and reported below for reference. The objective of this laboratory is to address the "initial condition problem"
using Wooldridge's method.

As the estimation time may be long, we ask Biogeme to report the details of the iterations.

In [None]:
logger = blog.get_screen_logger(level=blog.INFO)


**Tip:**<div class="alert alert-block alert-info">It is advised to start working with a low number of draws, until
the script is working well. Then, increase the number of draws to 10000, say. Then, execute the script overnight.
</div>

In [None]:
number_of_draws = 10


# Dynamic Choice Models with Panel Effects

We analyze again the smoking behavior of individuals, as a function of their age and the price of tobacco using
synthetic data. We develop a model that predicts, for every year, the probability to smoke or not.

## Data

We postulate a true model for the data generation process. It is a mixture of logit models where the utility
associated with "not smoking" is
\begin{equation}
U_{0nt}= \varepsilon_{0nt}
\end{equation}
and the utility associated with "smoking" is
\begin{equation}
U_{1nt}= \beta_{nt} y_{n,t-1} + \beta^p_{nt} P_{t} + c_n + \varepsilon_{1nt},
\end{equation}
where

- $\beta_{nt} = 10$,

- $y_{n,t-1}=1$ if $n$ is smoking at time $t-1$, $0$ otherwise,

- $\beta^p_{nt} = -0.1$,

- $P_t$ is the price of cigarets at time $t$,

- $c_n$ is an individual specific constant that captures the a priori, intrinsic attraction of each individual
towards smoking. It is assumed to be normally distributed in the population, with zero mean and standard deviation
50: $N(0, 50^2)$,

## True value of the parameters

In [None]:

true_parameters = pd.DataFrame(
    {'Value': [-0.1, 10, 0, 50]},
    index=['coef_price', 'beta_last_year', 'cte_mean', 'cte_std'],
)



## Data

We observe every individual only from the age of 45 and the age of 55.

In [None]:
df = pd.read_table('smoking55.dat', sep=',')
display(df)



Different values for age

In [None]:
print(df['Age'].unique())



The data contains the following columns:

- the age of the individual,
- the price of the cigarettes,
- a variable that is 1 if the individual is smoking, 0 otherwise,
- a variable that is 1 if the individual was smoking last year, 0 otherwise,
- a unique id for each individual,
- a variable that is 1 if the individual was smoking at the age of 45, in the beginning of the observation period.

In [None]:
database = Database('smoking55', df)



Variables

In [None]:
Price = Variable('Price')
Smoking = Variable('Smoking')
LastYear = Variable('LastYear')
Smoking45 = Variable('Smoking45')



We declare that the data set contains panel data.

In [None]:
database.panel('Id')



## Estimation procedure

The following procedure estimates the choice model, and returns the estimated parameters in a Pandas format. If the
model happens to have been already estimated, the estimation results are read from the pickle file and reported.

In [None]:
def estimate(
    the_logprob: Expression, the_name: str, the_database: Database
) -> pd.DataFrame:
    """Estimates the choice model (or read the estimation results from
        file if recycle is True), and returns the estimated parameters
        in a Pandas format.

    :param the_logprob: formula for the log likelihood function
    :param the_name: name of the model (important for the output files)
    :param the_database: database
    :return: estimated values of the parameters and statistics.
    """
    biogeme = BIOGEME(
        the_database,
        the_logprob,
        number_of_draws=number_of_draws,
    )
    biogeme.modelName = the_name
    results = biogeme.estimate()
    print(results.print_general_statistics())
    pandas_results = results.get_estimated_parameters()
    return pandas_results



## Dynamic model with serial correlation

In the previous quiz, we have estimated a dynamic model with panel effects to account for serial correlation.

In [None]:
cte_mean = Beta('cte_mean', 0, None, None, 0)
cte_std = Beta('cte_std', 1, None, None, 0)
cte = cte_mean + cte_std * bioDraws('agent', 'NORMAL_ANTI')
coef_price = Beta('coef_price', 0, None, None, 0)
beta_last_year = Beta('beta_last_year', 0, None, None, 0)



Model specification

In [None]:
V_s = beta_last_year * LastYear + coef_price * Price + cte
V_ns = 0
V = {0: V_ns, 1: V_s}
obsprob = logit(V, None, Smoking)
condprobIndiv = PanelLikelihoodTrajectory(obsprob)
logprob = log(MonteCarlo(condprobIndiv))


Estimation

In [None]:
r_serial_dynamic = estimate(logprob, 'dynamic_model_serial', database)
display(r_serial_dynamic)


### Comparison of the estimates

In [None]:
summary = pd.concat(
    [
        true_parameters['Value'],
        r_serial_dynamic['Value'],
    ],
    axis=1,
)
summary.columns = ['True', 'Dynamic + serial']
summary.fillna('')
display(summary)


We observe here the issue of the "initial condition problem". Although the model specification is correct (it is the
same model as the data generation process), the values of the parameters are not correctly recovered. It is because
the first observed choice, that is, the fact that an individual is smoking at the age of 45, is strongly correlated
with the agent effect. This creates endogeneity. One visible consequence is the positive price coefficient.

Estimate the parameters using Wooldridge's method to address the endogeneity issue.