CRE example

[1]:
# Add PyTwoWay to system path (do not run this)
# import sys
# sys.path.append('../../..')

Import the PyTwoWay package

Make sure to install it using pip install pytwoway.

[2]:
import pytwoway as tw
import bipartitepandas as bpd

First, check out parameter options

Do this by running:

  • CRE - tw.cre_params().describe_all()

  • Clustering - bpd.cluster_params().describe_all()

  • Cleaning - bpd.clean_params().describe_all()

  • Simulating - bpd.sim_params().describe_all()

Alternatively, run x_params().keys() to view all the keys for a parameter dictionary, then x_params().describe(key) to get a description for a single key.

Second, set parameter choices

Note

We set copy=False in clean_params to avoid unnecessary copies (although this may modify the original dataframe).

[3]:
# CRE
cre_params = tw.cre_params()
## Clustering ##
# Use firm-level cdfs of income as our measure
measures = bpd.measures.CDFs()
# Group using k-means
grouping = bpd.grouping.KMeans()
# General clustering
cluster_params = bpd.cluster_params(
    {
        'measures': measures,
        'grouping': grouping
    }
)
# Cleaning
clean_params = bpd.clean_params(
    {
        'connectedness': 'leave_out_spell',
        'collapse_at_connectedness_measure': True,
        'drop_single_stayers': True,
        'drop_returns': 'returners',
        'copy': False
    }
)
# Simulating
sim_params = bpd.sim_params(
    {
        'n_workers': 1000,
        'firm_size': 5,
        'alpha_sig': 2, 'w_sig': 2,
        'c_sort': 1.5, 'c_netw': 1.5,
        'p_move': 0.1
    }
)

Third, extract data (we simulate for the example)

BipartitePandas contains the class SimBipartite which we use here to simulate a bipartite network. If you have your own data, you can import it during this step. Load it as a Pandas DataFrame and then convert it into a BipartitePandas DataFrame in the next step.

[4]:
sim_data = bpd.SimBipartite(sim_params).simulate()

Fourth, prepare data

This is exactly how you should prepare real data prior to running the CRE estimator.

  • First, we convert the data into a BipartitePandas DataFrame

  • Second, we clean the data (e.g. drop NaN observations, make sure firm and worker ids are contiguous, construct the leave-one-out connected set, etc.). This also collapses the data at the worker-firm spell level (taking mean wage over the spell), because we set collapse_at_connectedness_measure=True.

  • Third, we cluster firms by their wage distributions, to generate firm classes (columns g1 and g2). Alternatively, manually set the columns g1 and g2 to pre-estimated clusters (but make sure to add them correctly!).

  • Fourth, we convert the data into cross-section format

Further details on BipartitePandas can be found in the package documentation, available here.

Note

Since leave-one-out connectedness is not maintained after data is collapsed at the spell/match level, if you set collapse_at_connectedness_measure=False, then data must be cleaned WITHOUT taking the leave-one-out set, collapsed at the spell/match level, and then finally the largest leave-one-out connected set can be computed.

[5]:
# Convert into BipartitePandas DataFrame
bdf = bpd.BipartiteDataFrame(sim_data)
# Clean and collapse
bdf = bdf.clean(clean_params)
# Cluster
bdf = bdf.cluster(cluster_params)
# Convert to cross-section format
bdf_cs = bdf.to_eventstudy(is_sorted=True, copy=False).get_cs(copy=False)
checking required columns and datatypes
sorting rows
dropping NaN observations
generating 'm' column
keeping highest paying job for i-t (worker-year) duplicates (how='max')
dropping workers who leave a firm then return to it (how='returners')
making 'i' ids contiguous
making 'j' ids contiguous
computing largest connected set (how=None)
sorting columns
resetting index
checking required columns and datatypes
sorting rows
generating 'm' column
computing largest connected set (how='leave_out_observation')
making 'i' ids contiguous
making 'j' ids contiguous
sorting columns
resetting index

Fifth, initialize and run the estimator

[6]:
# Initialize CRE estimator
cre_estimator = tw.CREEstimator(bdf_cs, cre_params)
# Fit CRE estimator
cre_estimator.fit()

Finally, investigate the results

[7]:
cre_estimator.summary
[7]:
{'var_y': 6.700106866872871,
 'var_bw': 1.5696696168532582,
 'cov_bw': 1.1955111878642923,
 'var_tot': 1.442983307938817,
 'cov_tot': 1.1647823062712872}