FE example

Note

PyTwoWay includes two classes for estimating the AKM model and its bias corrections: FEEstimator to estimate without controls, and FEControlEstimator to estimate with controls.

FEEstimator takes advantage of the structure of the AKM model without controls to optimize estimation speed, and is considerably faster than FEControlEstimator for this. However, the cost of this optimiziation is that the FEEstimator class is unable to estimate the model with control variables.

Hint

If you want to estimate a one-way fixed effect model, you can fill in the i column with all 1s, and the estimated alpha_i will be the intercept.

Warning

If you are using a large dataset (100 million observations+), it is recommended to switch your solver to AMG or switch to either the Jacobi or V-Cycle preconditioner. However, regardless of the size of your dataset, it is a good idea to try out the different solvers and preconditioners to see which works best for your particular data.

[1]:

# Add PyTwoWay to system path (do not run this)
# import sys
# sys.path.append('../../..')

Import the PyTwoWay package

Make sure to install it using pip install pytwoway.

[2]:

import pandas as pd
import pytwoway as tw
import bipartitepandas as bpd

FE WITHOUT CONTROLS

First, check out parameter options

Do this by running:

FE - tw.fe_params().describe_all()
Cleaning - bpd.clean_params().describe_all()
Simulating - bpd.sim_params().describe_all()

Alternatively, run x_params().keys() to view all the keys for a parameter dictionary, then x_params().describe(key) to get a description for a single key.

Second, set parameter choices

Hint

If you just want to retrieve the firm and worker effects from the OLS estimation, set 'feonly': True and 'attach_fe_estimates': True in your FE parameters dictionary.

If you want the OLS estimates to be linked to the original firm and worker ids, when initializing your BipartitePandas DataFrame set track_id_changes=True, then run df = bdf.original_ids() after fitting the estimator to extract a Pandas DataFrame with the original ids attached.

Hint

If you want to retrieve the vectors of the firm and worker effects from the OLS estimation, the estimated psi vector (firm effects) can be accessed via the class attribute .psi_hat, and the estimated alpha vector (worker effects) can be accessed via the class attribute .alpha_hat. Because the first firm is normalized to 0, you will need to append a 0 to the beginning of the psi vector for it to include all firm effects.

Note

We set copy=False in clean_params to avoid unnecessary copies (although this may modify the original dataframe).

[3]:

# FE
fe_params = tw.fe_params(
    {
        'he': True,
        'ncore': 8
    }
)
# Cleaning
clean_params = bpd.clean_params(
    {
        'connectedness': 'leave_out_spell',
        'collapse_at_connectedness_measure': True,
        'drop_single_stayers': True,
        'drop_returns': 'returners',
        'copy': False
    }
)
# Simulating
sim_params = bpd.sim_params(
    {
        'n_workers': 1000,
        'firm_size': 5,
        'alpha_sig': 2, 'w_sig': 2,
        'c_sort': 1.5, 'c_netw': 1.5,
        'p_move': 0.1
    }
)

Third, extract data (we simulate for the example)

BipartitePandas contains the class SimBipartite which we use here to simulate a bipartite network. If you have your own data, you can import it during this step. Load it as a Pandas DataFrame and then convert it into a BipartitePandas DataFrame in the next step.

[4]:

sim_data = bpd.SimBipartite(sim_params).simulate()

Fourth, prepare data

This is exactly how you should prepare real data prior to running the FE estimator.

First, we convert the data into a BipartitePandas DataFrame
Second, we clean the data (e.g. drop NaN observations, make sure firm and worker ids are contiguous, construct the leave-one-out connected set, etc.). This also collapses the data at the worker-firm spell level (taking mean wage over the spell), because we set collapse_at_connectedness_measure=True.

Further details on BipartitePandas can be found in the package documentation, available here.

Note

Since leave-one-out connectedness is not maintained after data is collapsed at the spell/match level, if you set collapse_at_connectedness_measure=False, then data must be cleaned WITHOUT taking the leave-one-out set, collapsed at the spell/match level, and then finally the largest leave-one-out connected set can be computed.

[5]:

# Convert into BipartitePandas DataFrame
bdf = bpd.BipartiteDataFrame(sim_data)
# Clean and collapse
bdf = bdf.clean(clean_params)

checking required columns and datatypes
sorting rows
dropping NaN observations
generating 'm' column
keeping highest paying job for i-t (worker-year) duplicates (how='max')
dropping workers who leave a firm then return to it (how='returners')
making 'i' ids contiguous
making 'j' ids contiguous
computing largest connected set (how=None)
sorting columns
resetting index
checking required columns and datatypes
sorting rows
generating 'm' column
computing largest connected set (how='leave_out_observation')
making 'i' ids contiguous
making 'j' ids contiguous
sorting columns
resetting index

Fifth, initialize and run the estimator

[ ]:

# Initialize FE estimator
fe_estimator = tw.FEEstimator(bdf, fe_params)
# Fit FE estimator
fe_estimator.fit()

Finally, investigate the results

Results correspond to:

y: income (outcome) column
eps: residual
psi: firm effects
alpha: worker effects
fe: plug-in (biased) estimate
ho: homoskedastic-corrected estimate
he: heteroskedastic-corrected estimate

Warning

If you notice variability between estimations for the HO- and HE-corrected results, this is because there are approximations in the estimation that depend on randomization. Increasing the number of draws for the approximations (ndraw_trace_sigma_2 and ndraw_trace_ho for the HO correction, and ndraw_trace_he, and ndraw_lev_he for the HE correction) will increase the stability of the results between estimations.

Note

The particular variance that is estimated is controlled through the parameter 'Q_var' and the covariance that is estimated is controlled through the parameter 'Q_cov'.

By default, the variance is var(psi) and the covariance is cov(psi, alpha). The default estimates don’t include var(alpha), but if you don’t include controls, var(alpha) can be computed as the residual from var(y) = var(psi) + var(alpha) + 2 * cov(psi, alpha) + var(eps).

[7]:

fe_estimator.summary

[7]:

{'var(y)': 6.52609861225365,
 'var(eps)_fe': 0.29744517651253743,
 'var(eps)_ho': 3.606426490650048,
 'var(eps)_he': 3.494319592157181,
 'var(psi)_fe': 2.3121637422731225,
 'var(psi)_ho': 0.6031785843717348,
 'var(psi)_he': 0.8520073462916467,
 'cov(psi, alpha)_fe': -0.24878106304549624,
 'cov(psi, alpha)_ho': 1.3194284319766338,
 'cov(psi, alpha)_he': 0.9519246905053926}

FE WITH CONTROLS

Hint

Check out how to add custom columns to your BipartitePandas DataFrame here! If you don’t add custom columns properly, they may not be handled during data cleaning and estimation how you want and/or expect!

First, check out parameter options

Do this by running:

FE with controls - tw.fecontrol_params().describe_all()
Cleaning - bpd.clean_params().describe_all()
Simulating - tw.sim_blm_params().describe_all(), tw.sim_categorical_control_params().describe_all(), and tw.sim_continuous_control_params().describe_all()

Alternatively, run x_params().keys() to view all the keys for a parameter dictionary, then x_params().describe(key) to get a description for a single key.

Second, set parameter choices

Hint

If you just want to retrieve the firm and worker effects from the OLS estimation, set 'feonly': True and 'attach_fe_estimates': True in your FE parameters dictionary.

If you want the OLS estimates to be linked to the original firm and worker ids, when initializing your BipartitePandas DataFrame set track_id_changes=True, then run df = bdf.original_ids() after fitting the estimator to extract a Pandas DataFrame with the original ids attached.

Hint

If you want to retrieve the estimated parameter vectors from the OLS estimation, each covariate’s parameter vector can be accessed via the class attribute .gamma_hat_dict. For categorical variables, the normalized type will automatically be included in this vector (with value 0).

Note

We control which variances and covariances to estimate through the parameters Q_var and Q_cov. Multiple variances/covariances can be estimated by setting Q_var and/or Q_cov to be a list of variances/covariances, and the variances/covariances of sums of covariates can be estimated by inputting a list of the covariates to sum.

Note

We set copy=False in clean_params to avoid unnecessary copies (although this may modify the original dataframe).

[8]:

# FE
fecontrol_params = tw.fecontrol_params(
    {
        'he': True,
        'categorical_controls': 'cat_control',
        'continuous_controls': 'cts_control',
        'Q_var': [
            tw.Q.VarCovariate('psi'),
            tw.Q.VarCovariate('alpha'),
            tw.Q.VarCovariate('cat_control'),
            tw.Q.VarCovariate('cts_control'),
            tw.Q.VarCovariate(['psi', 'alpha']),
            tw.Q.VarCovariate(['cat_control', 'cts_control'])
                 ],
        'Q_cov': [
            tw.Q.CovCovariate('psi', 'alpha'),
            tw.Q.CovCovariate('cat_control', 'cts_control'),
            tw.Q.CovCovariate(['psi', 'alpha'], ['cat_control', 'cts_control'])
        ],
        'ncore': 8
    }
)
# Cleaning
clean_params = bpd.clean_params(
    {
        'connectedness': 'leave_out_spell',
        'collapse_at_connectedness_measure': True,
        'drop_single_stayers': True,
        'drop_returns': 'returners',
        'copy': False
    }
)
# Simulating
nl = 3
nk = 4
n_control = 2
sim_cat_params = tw.sim_categorical_control_params({
    'n': n_control,
    'worker_type_interaction': False,
    'stationary_A': True, 'stationary_S': True
})
sim_cts_params = tw.sim_continuous_control_params({
    'worker_type_interaction': False,
    'stationary_A': True, 'stationary_S': True
})
sim_blm_params = tw.sim_blm_params({
    'nl': nl,
    'nk': nk,
    'categorical_controls': {
        'cat_control': sim_cat_params
    },
    'continuous_controls': {
        'cts_control': sim_cts_params
    },
    'stationary_A': True, 'stationary_S': True,
    'linear_additive': True
})

Third, extract data (we simulate for the example)

PyTwoWay contains the class SimBLM which we use here to simulate a bipartite network with controls. If you have your own data, you can import it during this step. Load it as a Pandas DataFrame and then convert it into a BipartitePandas DataFrame in the next step.

[9]:

blm_true = tw.SimBLM(sim_blm_params)
sim_data = blm_true.simulate(return_parameters=False)
jdata, sdata = sim_data['jdata'], sim_data['sdata']
sim_data = pd.concat([jdata, sdata]).rename({'g': 'j', 'j': 'g'}, axis=1, allow_optional=True, allow_required=True)[['i', 'j1', 'j2', 'y1', 'y2', 'cat_control1', 'cat_control2', 'cts_control1', 'cts_control2']].construct_artificial_time(is_sorted=True, copy=False)

Fourth, prepare data

This is exactly how you should prepare real data prior to running the FE estimator.

First, we convert the data into a BipartitePandas DataFrame
Second, we clean the data (e.g. drop NaN observations, make sure firm and worker ids are contiguous, construct the leave-one-out connected set, etc.). This also collapses the data at the worker-firm spell level (taking mean wage over the spell), because we set collapse_at_connectedness_measure=True.
Third, we convert the data to long format, since the simulated data is in event study format

Further details on BipartitePandas can be found in the package documentation, available here.

Note

Since leave-one-out connectedness is not maintained after data is collapsed at the spell/match level, if you set collapse_at_connectedness_measure=False, then data must be cleaned WITHOUT taking the leave-one-out set, collapsed at the spell/match level, and then finally the largest leave-one-out connected set can be computed.

[10]:

# Convert into BipartitePandas DataFrame
bdf = bpd.BipartiteDataFrame(sim_data)
# Clean and collapse
bdf = bdf.clean(clean_params)
# Convert to long format
bdf = bdf.to_long(is_sorted=True, copy=False)

checking required columns and datatypes
converting data to long format
checking required columns and datatypes
sorting rows
dropping NaN observations
generating 'm' column
keeping highest paying job for i-t (worker-year) duplicates (how='max')
dropping workers who leave a firm then return to it (how='returners')
making 'i' ids contiguous
making 'j' ids contiguous
computing largest connected set (how=None)
sorting columns
resetting index
checking required columns and datatypes
sorting rows
generating 'm' column
computing largest connected set (how='leave_out_observation')
making 'i' ids contiguous
sorting columns
resetting index
converting data back to event study format

Fifth, initialize and run the estimator

[ ]:

# Initialize FE estimator
fe_estimator = tw.FEControlEstimator(bdf, fecontrol_params)
# Fit FE estimator
fe_estimator.fit()

Finally, investigate the results

Results correspond to:

y: income (outcome) column
eps: residual
psi: firm effects
alpha: worker effects
cat_control: categorical control
cts_control: continuous control
fe: plug-in (biased) estimate
ho: homoskedastic-corrected estimate
he: heteroskedastic-corrected estimate

Warning

If you notice variability between estimations for the HO- and HE-corrected results, this is because there are approximations in the estimation that depend on randomization. Increasing the number of draws for the approximations (ndraw_trace_sigma_2 and ndraw_trace_ho for the HO correction, and ndraw_trace_he, and ndraw_lev_he for the HE correction) will increase the stability of the results between estimations.

Note

The particular variance that is estimated is controlled through the parameter 'Q_var' and the covariance that is estimated is controlled through the parameter 'Q_cov'.

By default, the variance is var(psi) and the covariance is cov(psi, alpha). The default estimates don’t include var(alpha), but if you don’t include controls, var(alpha) can be computed as the residual from var(y) = var(psi) + var(alpha) + 2 * cov(psi, alpha) + var(eps).

[12]:

fe_estimator.summary

[12]:

{'var(y)': 0.680298596161059,
 'var(eps)_fe': 0.2156307297689422,
 'var(eps)_ho': 0.44231944567988146,
 'var(eps)_he': 0.43588153776248767,
 'var(alpha)_fe': 0.20701243646676126,
 'var(alpha)_ho': -0.018774697829931963,
 'var(alpha)_he': -0.023690618333218116,
 'var(cat_control + cts_control)_fe': 0.32694285092523184,
 'var(cat_control + cts_control)_ho': 0.311843602331667,
 'var(cat_control + cts_control)_he': 0.3265315176982076,
 'var(cat_control)_fe': 0.149689434536445,
 'var(cat_control)_ho': 0.1650575130340637,
 'var(cat_control)_he': 0.11870237720576068,
 'var(cts_control)_fe': 0.19348330565838853,
 'var(cts_control)_ho': 0.1624089944895367,
 'var(cts_control)_he': 0.22477306005469888,
 'var(psi + alpha)_fe': 0.22607769787400603,
 'var(psi + alpha)_ho': -0.005739238232576577,
 'var(psi + alpha)_he': 0.002075155760591446,
 'var(psi)_fe': 0.04872564142089923,
 'var(psi)_ho': 0.04919870321819487,
 'var(psi)_he': 0.045543836373029785,
 'cov(cat_control, cts_control)_fe': -0.00811494463480081,
 'cov(cat_control, cts_control)_ho': -0.00875862700372186,
 'cov(cat_control, cts_control)_he': -0.008041963642879787,
 'cov(psi + alpha, cat_control + cts_control)_fe': -0.04417634129890746,
 'cov(psi + alpha, cat_control + cts_control)_ho': -0.03797455238506746,
 'cov(psi + alpha, cat_control + cts_control)_he': -0.03803269383182242,
 'cov(psi, alpha)_fe': -0.01483019000682723,
 'cov(psi, alpha)_ho': -0.021436896763781826,
 'cov(psi, alpha)_he': -0.009319726497861622}