\n",
"\n",
"Note\n",
"\n",
"`PyTwoWay` includes two classes for estimating the AKM model and its bias corrections: `FEEstimator` to estimate without controls, and `FEControlEstimator` to estimate with controls.\n",
"\n",
"`FEEstimator` takes advantage of the structure of the AKM model without controls to optimize estimation speed, and is considerably faster than `FEControlEstimator` for this. However, the cost of this optimiziation is that the `FEEstimator` class is unable to estimate the model with control variables.\n",
"\n",
"
\n",
"\n",
"
\n",
"\n",
"Hint\n",
"\n",
"If you want to estimate a one-way fixed effect model, you can fill in the `i` column with all `1`s, and the estimated `alpha_i` will be the intercept.\n",
"\n",
"
\n",
"\n",
"
\n",
"\n",
"Warning\n",
"\n",
"If you are using a large dataset (100 million observations+), it is recommended to switch your solver to `AMG` or switch to either the `Jacobi` or `V-Cycle` preconditioner. However, regardless of the size of your dataset, it is a good idea to try out the different solvers and preconditioners to see which works best for your particular data.\n",
"\n",
"
"
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {},
"outputs": [],
"source": [
"# Add PyTwoWay to system path (do not run this)\n",
"# import sys\n",
"# sys.path.append('../../..')"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Import the PyTwoWay package\n",
"\n",
"Make sure to install it using `pip install pytwoway`."
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {
"ExecuteTime": {
"end_time": "2021-01-15T23:38:19.123052Z",
"start_time": "2021-01-15T23:38:18.565950Z"
}
},
"outputs": [],
"source": [
"import pandas as pd\n",
"import pytwoway as tw\n",
"import bipartitepandas as bpd"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## FE WITHOUT CONTROLS\n",
"\n",
"## First, check out parameter options\n",
"\n",
"Do this by running:\n",
"\n",
"- FE - `tw.fe_params().describe_all()`\n",
"\n",
"- Cleaning - `bpd.clean_params().describe_all()`\n",
"\n",
"- Simulating - `bpd.sim_params().describe_all()`\n",
"\n",
"Alternatively, run `x_params().keys()` to view all the keys for a parameter dictionary, then `x_params().describe(key)` to get a description for a single key."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Second, set parameter choices\n",
"\n",
"
\n",
"\n",
"Hint\n",
"\n",
"If you just want to retrieve the firm and worker effects from the OLS estimation, set `'feonly': True` and `'attach_fe_estimates': True` in your FE parameters dictionary.\n",
"\n",
"If you want the OLS estimates to be linked to the original firm and worker ids, when initializing your BipartitePandas DataFrame set `track_id_changes=True`, then run `df = bdf.original_ids()` after fitting the estimator to extract a Pandas DataFrame with the original ids attached.\n",
"\n",
"
\n",
"\n",
"
\n",
"\n",
"Hint\n",
"\n",
"If you want to retrieve the vectors of the firm and worker effects from the OLS estimation, the estimated `psi` vector (firm effects) can be accessed via the class attribute `.psi_hat`, and the estimated `alpha` vector (worker effects) can be accessed via the class attribute `.alpha_hat`. Because the first firm is normalized to `0`, you will need to append a `0` to the beginning of the `psi` vector for it to include all firm effects.\n",
"\n",
"
\n",
"\n",
"
\n",
"\n",
"Note\n",
"\n",
"We set `copy=False` in `clean_params` to avoid unnecessary copies (although this may modify the original dataframe).\n",
"\n",
"
"
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {},
"outputs": [],
"source": [
"# FE\n",
"fe_params = tw.fe_params(\n",
" {\n",
" 'he': True,\n",
" 'ncore': 8\n",
" }\n",
")\n",
"# Cleaning\n",
"clean_params = bpd.clean_params(\n",
" {\n",
" 'connectedness': 'leave_out_spell',\n",
" 'collapse_at_connectedness_measure': True,\n",
" 'drop_single_stayers': True,\n",
" 'drop_returns': 'returners',\n",
" 'copy': False\n",
" }\n",
")\n",
"# Simulating\n",
"sim_params = bpd.sim_params(\n",
" {\n",
" 'n_workers': 1000,\n",
" 'firm_size': 5,\n",
" 'alpha_sig': 2, 'w_sig': 2,\n",
" 'c_sort': 1.5, 'c_netw': 1.5,\n",
" 'p_move': 0.1\n",
" }\n",
")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Third, extract data (we simulate for the example)\n",
"\n",
"`BipartitePandas` contains the class `SimBipartite` which we use here to simulate a bipartite network. If you have your own data, you can import it during this step. Load it as a `Pandas DataFrame` and then convert it into a `BipartitePandas DataFrame` in the next step."
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {},
"outputs": [],
"source": [
"sim_data = bpd.SimBipartite(sim_params).simulate()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Fourth, prepare data\n",
"\n",
"This is exactly how you should prepare real data prior to running the FE estimator.\n",
"\n",
"- First, we convert the data into a `BipartitePandas DataFrame`\n",
"\n",
"- Second, we clean the data (e.g. drop NaN observations, make sure firm and worker ids are contiguous, construct the leave-one-out connected set, etc.). This also collapses the data at the worker-firm spell level (taking mean wage over the spell), because we set `collapse_at_connectedness_measure=True`.\n",
"\n",
"Further details on `BipartitePandas` can be found in the package documentation, available [here](https://tlamadon.github.io/bipartitepandas/).\n",
"\n",
"
\n",
"\n",
"Note\n",
"\n",
"Since leave-one-out connectedness is not maintained after data is collapsed at the spell/match level, if you set `collapse_at_connectedness_measure=False`, then data must be cleaned WITHOUT taking the leave-one-out set, collapsed at the spell/match level, and then finally the largest leave-one-out connected set can be computed.\n",
"\n",
"
"
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"checking required columns and datatypes\n",
"sorting rows\n",
"dropping NaN observations\n",
"generating 'm' column\n",
"keeping highest paying job for i-t (worker-year) duplicates (how='max')\n",
"dropping workers who leave a firm then return to it (how='returners')\n",
"making 'i' ids contiguous\n",
"making 'j' ids contiguous\n",
"computing largest connected set (how=None)\n",
"sorting columns\n",
"resetting index\n",
"checking required columns and datatypes\n",
"sorting rows\n",
"generating 'm' column\n",
"computing largest connected set (how='leave_out_observation')\n",
"making 'i' ids contiguous\n",
"making 'j' ids contiguous\n",
"sorting columns\n",
"resetting index\n"
]
}
],
"source": [
"# Convert into BipartitePandas DataFrame\n",
"bdf = bpd.BipartiteDataFrame(sim_data)\n",
"# Clean and collapse\n",
"bdf = bdf.clean(clean_params)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Fifth, initialize and run the estimator"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Initialize FE estimator\n",
"fe_estimator = tw.FEEstimator(bdf, fe_params)\n",
"# Fit FE estimator\n",
"fe_estimator.fit()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Finally, investigate the results\n",
"\n",
"Results correspond to:\n",
"\n",
"- `y`: income (outcome) column\n",
"- `eps`: residual\n",
"- `psi`: firm effects\n",
"- `alpha`: worker effects\n",
"- `fe`: plug-in (biased) estimate\n",
"- `ho`: homoskedastic-corrected estimate\n",
"- `he`: heteroskedastic-corrected estimate\n",
"\n",
"
\n",
"\n",
"Warning\n",
"\n",
"If you notice variability between estimations for the HO- and HE-corrected results, this is because there are approximations in the estimation that depend on randomization. Increasing the number of draws for the approximations (`ndraw_trace_sigma_2` and `ndraw_trace_ho` for the HO correction, and `ndraw_trace_he`, and `ndraw_lev_he` for the HE correction) will increase the stability of the results between estimations.\n",
"\n",
"
\n",
"\n",
"
\n",
"\n",
"Note\n",
"\n",
"The particular variance that is estimated is controlled through the parameter `'Q_var'` and the covariance that is estimated is controlled through the parameter `'Q_cov'`.\n",
"\n",
"By default, the variance is `var(psi)` and the covariance is `cov(psi, alpha)`. The default estimates don't include `var(alpha)`, but if you don't include controls, `var(alpha)` can be computed as the residual from `var(y) = var(psi) + var(alpha) + 2 * cov(psi, alpha) + var(eps)`.\n",
"\n",
"
\n",
"\n",
"Hint\n",
"\n",
"Check out how to add custom columns to your BipartitePandas DataFrame [here](https://tlamadon.github.io/bipartitepandas/notebooks/custom_columns.html#)! If you don't add custom columns properly, they may not be handled during data cleaning and estimation how you want and/or expect!\n",
"\n",
"
\n",
"\n",
"## First, check out parameter options\n",
"\n",
"Do this by running:\n",
"\n",
"- FE with controls - `tw.fecontrol_params().describe_all()`\n",
"\n",
"- Cleaning - `bpd.clean_params().describe_all()`\n",
"\n",
"- Simulating - `tw.sim_blm_params().describe_all()`, `tw.sim_categorical_control_params().describe_all()`, and `tw.sim_continuous_control_params().describe_all()`\n",
"\n",
"Alternatively, run `x_params().keys()` to view all the keys for a parameter dictionary, then `x_params().describe(key)` to get a description for a single key."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Second, set parameter choices\n",
"\n",
"
\n",
"\n",
"Hint\n",
"\n",
"If you just want to retrieve the firm and worker effects from the OLS estimation, set `'feonly': True` and `'attach_fe_estimates': True` in your FE parameters dictionary.\n",
"\n",
"If you want the OLS estimates to be linked to the original firm and worker ids, when initializing your BipartitePandas DataFrame set `track_id_changes=True`, then run `df = bdf.original_ids()` after fitting the estimator to extract a Pandas DataFrame with the original ids attached.\n",
"\n",
"
\n",
"\n",
"
\n",
"\n",
"Hint\n",
"\n",
"If you want to retrieve the estimated parameter vectors from the OLS estimation, each covariate's parameter vector can be accessed via the class attribute `.gamma_hat_dict`. For categorical variables, the normalized type will automatically be included in this vector (with value 0).\n",
"\n",
"
\n",
"\n",
"
\n",
"\n",
"Note\n",
"\n",
"We control which variances and covariances to estimate through the parameters `Q_var` and `Q_cov`. Multiple variances/covariances can be estimated by setting `Q_var` and/or `Q_cov` to be a list of variances/covariances, and the variances/covariances of sums of covariates can be estimated by inputting a list of the covariates to sum.\n",
"\n",
"
\n",
"\n",
"
\n",
"\n",
"Note\n",
"\n",
"We set `copy=False` in `clean_params` to avoid unnecessary copies (although this may modify the original dataframe).\n",
"\n",
"
"
]
},
{
"cell_type": "code",
"execution_count": 8,
"metadata": {},
"outputs": [],
"source": [
"# FE\n",
"fecontrol_params = tw.fecontrol_params(\n",
" {\n",
" 'he': True,\n",
" 'categorical_controls': 'cat_control',\n",
" 'continuous_controls': 'cts_control',\n",
" 'Q_var': [\n",
" tw.Q.VarCovariate('psi'),\n",
" tw.Q.VarCovariate('alpha'),\n",
" tw.Q.VarCovariate('cat_control'),\n",
" tw.Q.VarCovariate('cts_control'),\n",
" tw.Q.VarCovariate(['psi', 'alpha']),\n",
" tw.Q.VarCovariate(['cat_control', 'cts_control'])\n",
" ],\n",
" 'Q_cov': [\n",
" tw.Q.CovCovariate('psi', 'alpha'),\n",
" tw.Q.CovCovariate('cat_control', 'cts_control'),\n",
" tw.Q.CovCovariate(['psi', 'alpha'], ['cat_control', 'cts_control'])\n",
" ],\n",
" 'ncore': 8\n",
" }\n",
")\n",
"# Cleaning\n",
"clean_params = bpd.clean_params(\n",
" {\n",
" 'connectedness': 'leave_out_spell',\n",
" 'collapse_at_connectedness_measure': True,\n",
" 'drop_single_stayers': True,\n",
" 'drop_returns': 'returners',\n",
" 'copy': False\n",
" }\n",
")\n",
"# Simulating\n",
"nl = 3\n",
"nk = 4\n",
"n_control = 2\n",
"sim_cat_params = tw.sim_categorical_control_params({\n",
" 'n': n_control,\n",
" 'worker_type_interaction': False,\n",
" 'stationary_A': True, 'stationary_S': True\n",
"})\n",
"sim_cts_params = tw.sim_continuous_control_params({\n",
" 'worker_type_interaction': False,\n",
" 'stationary_A': True, 'stationary_S': True\n",
"})\n",
"sim_blm_params = tw.sim_blm_params({\n",
" 'nl': nl,\n",
" 'nk': nk,\n",
" 'categorical_controls': {\n",
" 'cat_control': sim_cat_params\n",
" },\n",
" 'continuous_controls': {\n",
" 'cts_control': sim_cts_params\n",
" },\n",
" 'stationary_A': True, 'stationary_S': True,\n",
" 'linear_additive': True\n",
"})"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Third, extract data (we simulate for the example)\n",
"\n",
"`PyTwoWay` contains the class `SimBLM` which we use here to simulate a bipartite network with controls. If you have your own data, you can import it during this step. Load it as a `Pandas DataFrame` and then convert it into a `BipartitePandas DataFrame` in the next step."
]
},
{
"cell_type": "code",
"execution_count": 9,
"metadata": {},
"outputs": [],
"source": [
"blm_true = tw.SimBLM(sim_blm_params)\n",
"sim_data = blm_true.simulate(return_parameters=False)\n",
"jdata, sdata = sim_data['jdata'], sim_data['sdata']\n",
"sim_data = pd.concat([jdata, sdata]).rename({'g': 'j', 'j': 'g'}, axis=1, allow_optional=True, allow_required=True)[['i', 'j1', 'j2', 'y1', 'y2', 'cat_control1', 'cat_control2', 'cts_control1', 'cts_control2']].construct_artificial_time(is_sorted=True, copy=False)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Fourth, prepare data\n",
"\n",
"This is exactly how you should prepare real data prior to running the FE estimator.\n",
"\n",
"- First, we convert the data into a `BipartitePandas DataFrame`\n",
"\n",
"- Second, we clean the data (e.g. drop NaN observations, make sure firm and worker ids are contiguous, construct the leave-one-out connected set, etc.). This also collapses the data at the worker-firm spell level (taking mean wage over the spell), because we set `collapse_at_connectedness_measure=True`.\n",
"\n",
"- Third, we convert the data to long format, since the simulated data is in event study format\n",
"\n",
"Further details on `BipartitePandas` can be found in the package documentation, available [here](https://tlamadon.github.io/bipartitepandas/).\n",
"\n",
"
\n",
"\n",
"Note\n",
"\n",
"Since leave-one-out connectedness is not maintained after data is collapsed at the spell/match level, if you set `collapse_at_connectedness_measure=False`, then data must be cleaned WITHOUT taking the leave-one-out set, collapsed at the spell/match level, and then finally the largest leave-one-out connected set can be computed.\n",
"\n",
"
"
]
},
{
"cell_type": "code",
"execution_count": 10,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"checking required columns and datatypes\n",
"converting data to long format\n",
"checking required columns and datatypes\n",
"sorting rows\n",
"dropping NaN observations\n",
"generating 'm' column\n",
"keeping highest paying job for i-t (worker-year) duplicates (how='max')\n",
"dropping workers who leave a firm then return to it (how='returners')\n",
"making 'i' ids contiguous\n",
"making 'j' ids contiguous\n",
"computing largest connected set (how=None)\n",
"sorting columns\n",
"resetting index\n",
"checking required columns and datatypes\n",
"sorting rows\n",
"generating 'm' column\n",
"computing largest connected set (how='leave_out_observation')\n",
"making 'i' ids contiguous\n",
"sorting columns\n",
"resetting index\n",
"converting data back to event study format\n"
]
}
],
"source": [
"# Convert into BipartitePandas DataFrame\n",
"bdf = bpd.BipartiteDataFrame(sim_data)\n",
"# Clean and collapse\n",
"bdf = bdf.clean(clean_params)\n",
"# Convert to long format\n",
"bdf = bdf.to_long(is_sorted=True, copy=False)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Fifth, initialize and run the estimator"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Initialize FE estimator\n",
"fe_estimator = tw.FEControlEstimator(bdf, fecontrol_params)\n",
"# Fit FE estimator\n",
"fe_estimator.fit()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Finally, investigate the results\n",
"\n",
"Results correspond to:\n",
"\n",
"- `y`: income (outcome) column\n",
"- `eps`: residual\n",
"- `psi`: firm effects\n",
"- `alpha`: worker effects\n",
"- `cat_control`: categorical control\n",
"- `cts_control`: continuous control\n",
"- `fe`: plug-in (biased) estimate\n",
"- `ho`: homoskedastic-corrected estimate\n",
"- `he`: heteroskedastic-corrected estimate\n",
"\n",
"
\n",
"\n",
"Warning\n",
"\n",
"If you notice variability between estimations for the HO- and HE-corrected results, this is because there are approximations in the estimation that depend on randomization. Increasing the number of draws for the approximations (`ndraw_trace_sigma_2` and `ndraw_trace_ho` for the HO correction, and `ndraw_trace_he`, and `ndraw_lev_he` for the HE correction) will increase the stability of the results between estimations.\n",
"\n",
"
\n",
"\n",
"
\n",
"\n",
"Note\n",
"\n",
"The particular variance that is estimated is controlled through the parameter `'Q_var'` and the covariance that is estimated is controlled through the parameter `'Q_cov'`.\n",
"\n",
"By default, the variance is `var(psi)` and the covariance is `cov(psi, alpha)`. The default estimates don't include `var(alpha)`, but if you don't include controls, `var(alpha)` can be computed as the residual from `var(y) = var(psi) + var(alpha) + 2 * cov(psi, alpha) + var(eps)`.\n",
"\n",
"