FE example
Note
PyTwoWay
includes two classes for estimating the AKM model and its bias corrections: FEEstimator
to estimate without controls, and FEControlEstimator
to estimate with controls.
FEEstimator
takes advantage of the structure of the AKM model without controls to optimize estimation speed, and is considerably faster than FEControlEstimator
for this. However, the cost of this optimiziation is that the FEEstimator
class is unable to estimate the model with control variables.
Hint
If you want to estimate a one-way fixed effect model, you can fill in the i
column with all 1
s, and the estimated alpha_i
will be the intercept.
Warning
If you are using a large dataset (100 million observations+), it is recommended to switch your solver to AMG
or switch to either the Jacobi
or V-Cycle
preconditioner. However, regardless of the size of your dataset, it is a good idea to try out the different solvers and preconditioners to see which works best for your particular data.
[1]:
# Add PyTwoWay to system path (do not run this)
# import sys
# sys.path.append('../../..')
Import the PyTwoWay package
Make sure to install it using pip install pytwoway
.
[2]:
import pandas as pd
import pytwoway as tw
import bipartitepandas as bpd
FE WITHOUT CONTROLS
First, check out parameter options
Do this by running:
FE -
tw.fe_params().describe_all()
Cleaning -
bpd.clean_params().describe_all()
Simulating -
bpd.sim_params().describe_all()
Alternatively, run x_params().keys()
to view all the keys for a parameter dictionary, then x_params().describe(key)
to get a description for a single key.
Second, set parameter choices
Hint
If you just want to retrieve the firm and worker effects from the OLS estimation, set 'feonly': True
and 'attach_fe_estimates': True
in your FE parameters dictionary.
If you want the OLS estimates to be linked to the original firm and worker ids, when initializing your BipartitePandas DataFrame set track_id_changes=True
, then run df = bdf.original_ids()
after fitting the estimator to extract a Pandas DataFrame with the original ids attached.
Hint
If you want to retrieve the vectors of the firm and worker effects from the OLS estimation, the estimated psi
vector (firm effects) can be accessed via the class attribute .psi_hat
, and the estimated alpha
vector (worker effects) can be accessed via the class attribute .alpha_hat
. Because the first firm is normalized to 0
, you will need to append a 0
to the beginning of the psi
vector for it to include all firm effects.
Note
We set copy=False
in clean_params
to avoid unnecessary copies (although this may modify the original dataframe).
[3]:
# FE
fe_params = tw.fe_params(
{
'he': True,
'ncore': 8
}
)
# Cleaning
clean_params = bpd.clean_params(
{
'connectedness': 'leave_out_spell',
'collapse_at_connectedness_measure': True,
'drop_single_stayers': True,
'drop_returns': 'returners',
'copy': False
}
)
# Simulating
sim_params = bpd.sim_params(
{
'n_workers': 1000,
'firm_size': 5,
'alpha_sig': 2, 'w_sig': 2,
'c_sort': 1.5, 'c_netw': 1.5,
'p_move': 0.1
}
)
Third, extract data (we simulate for the example)
BipartitePandas
contains the class SimBipartite
which we use here to simulate a bipartite network. If you have your own data, you can import it during this step. Load it as a Pandas DataFrame
and then convert it into a BipartitePandas DataFrame
in the next step.
[4]:
sim_data = bpd.SimBipartite(sim_params).simulate()
Fourth, prepare data
This is exactly how you should prepare real data prior to running the FE estimator.
First, we convert the data into a
BipartitePandas DataFrame
Second, we clean the data (e.g. drop NaN observations, make sure firm and worker ids are contiguous, construct the leave-one-out connected set, etc.). This also collapses the data at the worker-firm spell level (taking mean wage over the spell), because we set
collapse_at_connectedness_measure=True
.
Further details on BipartitePandas
can be found in the package documentation, available here.
Note
Since leave-one-out connectedness is not maintained after data is collapsed at the spell/match level, if you set collapse_at_connectedness_measure=False
, then data must be cleaned WITHOUT taking the leave-one-out set, collapsed at the spell/match level, and then finally the largest leave-one-out connected set can be computed.
[5]:
# Convert into BipartitePandas DataFrame
bdf = bpd.BipartiteDataFrame(sim_data)
# Clean and collapse
bdf = bdf.clean(clean_params)
checking required columns and datatypes
sorting rows
dropping NaN observations
generating 'm' column
keeping highest paying job for i-t (worker-year) duplicates (how='max')
dropping workers who leave a firm then return to it (how='returners')
making 'i' ids contiguous
making 'j' ids contiguous
computing largest connected set (how=None)
sorting columns
resetting index
checking required columns and datatypes
sorting rows
generating 'm' column
computing largest connected set (how='leave_out_observation')
making 'i' ids contiguous
making 'j' ids contiguous
sorting columns
resetting index
Fifth, initialize and run the estimator
[ ]:
# Initialize FE estimator
fe_estimator = tw.FEEstimator(bdf, fe_params)
# Fit FE estimator
fe_estimator.fit()
Finally, investigate the results
Results correspond to:
y
: income (outcome) columneps
: residualpsi
: firm effectsalpha
: worker effectsfe
: plug-in (biased) estimateho
: homoskedastic-corrected estimatehe
: heteroskedastic-corrected estimate
Warning
If you notice variability between estimations for the HO- and HE-corrected results, this is because there are approximations in the estimation that depend on randomization. Increasing the number of draws for the approximations (ndraw_trace_sigma_2
and ndraw_trace_ho
for the HO correction, and ndraw_trace_he
, and ndraw_lev_he
for the HE correction) will increase the stability of the results between estimations.
Note
The particular variance that is estimated is controlled through the parameter 'Q_var'
and the covariance that is estimated is controlled through the parameter 'Q_cov'
.
By default, the variance is var(psi)
and the covariance is cov(psi, alpha)
. The default estimates don’t include var(alpha)
, but if you don’t include controls, var(alpha)
can be computed as the residual from var(y) = var(psi) + var(alpha) + 2 * cov(psi, alpha) + var(eps)
.
[7]:
fe_estimator.summary
[7]:
{'var(y)': 6.52609861225365,
'var(eps)_fe': 0.29744517651253743,
'var(eps)_ho': 3.606426490650048,
'var(eps)_he': 3.494319592157181,
'var(psi)_fe': 2.3121637422731225,
'var(psi)_ho': 0.6031785843717348,
'var(psi)_he': 0.8520073462916467,
'cov(psi, alpha)_fe': -0.24878106304549624,
'cov(psi, alpha)_ho': 1.3194284319766338,
'cov(psi, alpha)_he': 0.9519246905053926}
FE WITH CONTROLS
Hint
Check out how to add custom columns to your BipartitePandas DataFrame here! If you don’t add custom columns properly, they may not be handled during data cleaning and estimation how you want and/or expect!
First, check out parameter options
Do this by running:
FE with controls -
tw.fecontrol_params().describe_all()
Cleaning -
bpd.clean_params().describe_all()
Simulating -
tw.sim_blm_params().describe_all()
,tw.sim_categorical_control_params().describe_all()
, andtw.sim_continuous_control_params().describe_all()
Alternatively, run x_params().keys()
to view all the keys for a parameter dictionary, then x_params().describe(key)
to get a description for a single key.
Second, set parameter choices
Hint
If you just want to retrieve the firm and worker effects from the OLS estimation, set 'feonly': True
and 'attach_fe_estimates': True
in your FE parameters dictionary.
If you want the OLS estimates to be linked to the original firm and worker ids, when initializing your BipartitePandas DataFrame set track_id_changes=True
, then run df = bdf.original_ids()
after fitting the estimator to extract a Pandas DataFrame with the original ids attached.
Hint
If you want to retrieve the estimated parameter vectors from the OLS estimation, each covariate’s parameter vector can be accessed via the class attribute .gamma_hat_dict
. For categorical variables, the normalized type will automatically be included in this vector (with value 0).
Note
We control which variances and covariances to estimate through the parameters Q_var
and Q_cov
. Multiple variances/covariances can be estimated by setting Q_var
and/or Q_cov
to be a list of variances/covariances, and the variances/covariances of sums of covariates can be estimated by inputting a list of the covariates to sum.
Note
We set copy=False
in clean_params
to avoid unnecessary copies (although this may modify the original dataframe).
[8]:
# FE
fecontrol_params = tw.fecontrol_params(
{
'he': True,
'categorical_controls': 'cat_control',
'continuous_controls': 'cts_control',
'Q_var': [
tw.Q.VarCovariate('psi'),
tw.Q.VarCovariate('alpha'),
tw.Q.VarCovariate('cat_control'),
tw.Q.VarCovariate('cts_control'),
tw.Q.VarCovariate(['psi', 'alpha']),
tw.Q.VarCovariate(['cat_control', 'cts_control'])
],
'Q_cov': [
tw.Q.CovCovariate('psi', 'alpha'),
tw.Q.CovCovariate('cat_control', 'cts_control'),
tw.Q.CovCovariate(['psi', 'alpha'], ['cat_control', 'cts_control'])
],
'ncore': 8
}
)
# Cleaning
clean_params = bpd.clean_params(
{
'connectedness': 'leave_out_spell',
'collapse_at_connectedness_measure': True,
'drop_single_stayers': True,
'drop_returns': 'returners',
'copy': False
}
)
# Simulating
nl = 3
nk = 4
n_control = 2
sim_cat_params = tw.sim_categorical_control_params({
'n': n_control,
'worker_type_interaction': False,
'stationary_A': True, 'stationary_S': True
})
sim_cts_params = tw.sim_continuous_control_params({
'worker_type_interaction': False,
'stationary_A': True, 'stationary_S': True
})
sim_blm_params = tw.sim_blm_params({
'nl': nl,
'nk': nk,
'categorical_controls': {
'cat_control': sim_cat_params
},
'continuous_controls': {
'cts_control': sim_cts_params
},
'stationary_A': True, 'stationary_S': True,
'linear_additive': True
})
Third, extract data (we simulate for the example)
PyTwoWay
contains the class SimBLM
which we use here to simulate a bipartite network with controls. If you have your own data, you can import it during this step. Load it as a Pandas DataFrame
and then convert it into a BipartitePandas DataFrame
in the next step.
[9]:
blm_true = tw.SimBLM(sim_blm_params)
sim_data = blm_true.simulate(return_parameters=False)
jdata, sdata = sim_data['jdata'], sim_data['sdata']
sim_data = pd.concat([jdata, sdata]).rename({'g': 'j', 'j': 'g'}, axis=1, allow_optional=True, allow_required=True)[['i', 'j1', 'j2', 'y1', 'y2', 'cat_control1', 'cat_control2', 'cts_control1', 'cts_control2']].construct_artificial_time(is_sorted=True, copy=False)
Fourth, prepare data
This is exactly how you should prepare real data prior to running the FE estimator.
First, we convert the data into a
BipartitePandas DataFrame
Second, we clean the data (e.g. drop NaN observations, make sure firm and worker ids are contiguous, construct the leave-one-out connected set, etc.). This also collapses the data at the worker-firm spell level (taking mean wage over the spell), because we set
collapse_at_connectedness_measure=True
.Third, we convert the data to long format, since the simulated data is in event study format
Further details on BipartitePandas
can be found in the package documentation, available here.
Note
Since leave-one-out connectedness is not maintained after data is collapsed at the spell/match level, if you set collapse_at_connectedness_measure=False
, then data must be cleaned WITHOUT taking the leave-one-out set, collapsed at the spell/match level, and then finally the largest leave-one-out connected set can be computed.
[10]:
# Convert into BipartitePandas DataFrame
bdf = bpd.BipartiteDataFrame(sim_data)
# Clean and collapse
bdf = bdf.clean(clean_params)
# Convert to long format
bdf = bdf.to_long(is_sorted=True, copy=False)
checking required columns and datatypes
converting data to long format
checking required columns and datatypes
sorting rows
dropping NaN observations
generating 'm' column
keeping highest paying job for i-t (worker-year) duplicates (how='max')
dropping workers who leave a firm then return to it (how='returners')
making 'i' ids contiguous
making 'j' ids contiguous
computing largest connected set (how=None)
sorting columns
resetting index
checking required columns and datatypes
sorting rows
generating 'm' column
computing largest connected set (how='leave_out_observation')
making 'i' ids contiguous
sorting columns
resetting index
converting data back to event study format
Fifth, initialize and run the estimator
[ ]:
# Initialize FE estimator
fe_estimator = tw.FEControlEstimator(bdf, fecontrol_params)
# Fit FE estimator
fe_estimator.fit()
Finally, investigate the results
Results correspond to:
y
: income (outcome) columneps
: residualpsi
: firm effectsalpha
: worker effectscat_control
: categorical controlcts_control
: continuous controlfe
: plug-in (biased) estimateho
: homoskedastic-corrected estimatehe
: heteroskedastic-corrected estimate
Warning
If you notice variability between estimations for the HO- and HE-corrected results, this is because there are approximations in the estimation that depend on randomization. Increasing the number of draws for the approximations (ndraw_trace_sigma_2
and ndraw_trace_ho
for the HO correction, and ndraw_trace_he
, and ndraw_lev_he
for the HE correction) will increase the stability of the results between estimations.
Note
The particular variance that is estimated is controlled through the parameter 'Q_var'
and the covariance that is estimated is controlled through the parameter 'Q_cov'
.
By default, the variance is var(psi)
and the covariance is cov(psi, alpha)
. The default estimates don’t include var(alpha)
, but if you don’t include controls, var(alpha)
can be computed as the residual from var(y) = var(psi) + var(alpha) + 2 * cov(psi, alpha) + var(eps)
.
[12]:
fe_estimator.summary
[12]:
{'var(y)': 0.680298596161059,
'var(eps)_fe': 0.2156307297689422,
'var(eps)_ho': 0.44231944567988146,
'var(eps)_he': 0.43588153776248767,
'var(alpha)_fe': 0.20701243646676126,
'var(alpha)_ho': -0.018774697829931963,
'var(alpha)_he': -0.023690618333218116,
'var(cat_control + cts_control)_fe': 0.32694285092523184,
'var(cat_control + cts_control)_ho': 0.311843602331667,
'var(cat_control + cts_control)_he': 0.3265315176982076,
'var(cat_control)_fe': 0.149689434536445,
'var(cat_control)_ho': 0.1650575130340637,
'var(cat_control)_he': 0.11870237720576068,
'var(cts_control)_fe': 0.19348330565838853,
'var(cts_control)_ho': 0.1624089944895367,
'var(cts_control)_he': 0.22477306005469888,
'var(psi + alpha)_fe': 0.22607769787400603,
'var(psi + alpha)_ho': -0.005739238232576577,
'var(psi + alpha)_he': 0.002075155760591446,
'var(psi)_fe': 0.04872564142089923,
'var(psi)_ho': 0.04919870321819487,
'var(psi)_he': 0.045543836373029785,
'cov(cat_control, cts_control)_fe': -0.00811494463480081,
'cov(cat_control, cts_control)_ho': -0.00875862700372186,
'cov(cat_control, cts_control)_he': -0.008041963642879787,
'cov(psi + alpha, cat_control + cts_control)_fe': -0.04417634129890746,
'cov(psi + alpha, cat_control + cts_control)_ho': -0.03797455238506746,
'cov(psi + alpha, cat_control + cts_control)_he': -0.03803269383182242,
'cov(psi, alpha)_fe': -0.01483019000682723,
'cov(psi, alpha)_ho': -0.021436896763781826,
'cov(psi, alpha)_he': -0.009319726497861622}