Advanced features

Import the BipartitePandas package

Make sure to install it using pip install bipartitepandas.

[1]:
import bipartitepandas as bpd

Get your data ready

For this notebook, we simulate data.

[2]:
df = bpd.SimBipartite().simulate()
bdf = bpd.BipartiteDataFrame(
    i=df['i'], j=df['j'], y=df['y'], t=df['t']
)
display(bdf)
i j y t
0 0 66 -0.884222 0
1 0 49 -2.224977 1
2 0 49 -1.711429 2
3 0 115 -0.082440 3
4 0 6 -1.190540 4
... ... ... ... ...
49995 9999 146 0.505200 0
49996 9999 114 -0.095290 1
49997 9999 114 -1.240715 2
49998 9999 114 0.701339 3
49999 9999 114 -0.610246 4

50000 rows × 4 columns

Advanced data cleaning

Hint

Want details on all cleaning parameters? Run bpd.clean_params().describe_all(), or search through bpd.clean_params().keys() for a particular key, and then run bpd.clean_params().describe(key).

Set how to handle worker-year duplicates

Use the parameter i_t_how to customize how worker-year duplicates are handled.

Collapse at the match-level

If you drop the t column, collapsing will automatically collapse at the match level. However, this prevents conversion to event study format (this can be bypassed with the .construct_artificial_time() method, but the data will likely have a meaningless order, rendering the event study uninterpretable).

Avoid unnecessary copies

If you are working with a large dataset, you will want to avoid copies whenever possible. So set copy=False.

Avoid unnecessary sorts

If you know your data is sorted by i and t (or, if you aren’t including a t column, just by i), then set is_sorted=True.

Avoid complicated loops

Sometimes workers leave a firm, then return to it (we call these workers returners). Returners can cause computational difficulties because sometimes intermediate firms get dropped (e.g. a worker goes from firm \(A \to B \to A\), but firm \(B\) gets dropped). This turns returners into stayers. This can change the largest connected set of firms, and if data is in collapsed format, requires the data to be recollapsed.

Because of these potential complications, if there are returners, many methods require loops that run until convergence.

These difficulties can be avoided by setting the parameter drop_returns (there are multiple ways to handle returners, they can be seen by running bpd.clean_params().describe('drop_returns')).

Alternative

Another way to handle returners is to drop the t column. Then, sorting will automatically sort by i and j, which eliminates the possibility of returners. However, this prevents conversion to event study format (this can be bypassed with the .construct_artificial_time() method, but the data will likely have a meaningless order, rendering the event study uninterpretable).

Advanced clustering

Install Intel(R) Extension for Scikit-learn

Intel(R) Extension for Scikit-learn (GitHub) can speed up KMeans clustering.

Advanced dataframe handling

Disable logging

Logging can slow down basic operations on BipartitePandas dataframes (e.g. data cleaning). Set the parameter log=False when constructing your dataframe to turn off logging.

Use method chaining with in-place operations

Unlike standard Pandas, BipartitePandas allows method chaining with in-place operations.

Understand the difference between general columns and subcolumns

Users interact with general columns, while BipartitePandas dataframes display subcolumns. As an example, for event study format, the columns for firm clusters are labeled g1 and g2. These are the subcolumns for general column g. If you want to drop firm clusters from the dataframe, rather than dropping g1 and g2 separately, you must drop the general column g. This paradigm applies throughout BipartitePandas and the documentation will make clear when you should specify general columns.

Bypass restrictions

Sometimes BipartitePandas imposes restrictions that you may want to bypass. While BipartitePandas does not provide an explicit way to disable these restrictions, you can bypass them by converting your data into a Pandas dataframe, running the code that was formerly restricted, then converting it back into a BipartitePandas dataframe.

Simpler constructor

If the columns in your Pandas dataframe are already named correctly, you can simply put the dataframe as a parameter into the BipartitePandas dataframe constructor. Here is an example:

[3]:
bdf = bpd.BipartiteDataFrame(df).clean()
display(bdf)
checking required columns and datatypes
sorting rows
dropping NaN observations
generating 'm' column
keeping highest paying job for i-t (worker-year) duplicates (how='max')
dropping workers who leave a firm then return to it (how=False)
making 'i' ids contiguous
making 'j' ids contiguous
computing largest connected set (how=None)
sorting columns
resetting index
i j y t m alpha k l psi
0 0 66 -0.884222 0 1 -0.967422 3 0 -0.348756
1 0 49 -2.224977 1 1 -0.967422 2 0 -0.604585
2 0 49 -1.711429 2 1 -0.967422 2 0 -0.604585
3 0 115 -0.082440 3 2 -0.967422 5 0 0.114185
4 0 6 -1.190540 4 1 -0.967422 0 0 -1.335178
... ... ... ... ... ... ... ... ... ...
49995 9999 146 0.505200 0 1 0.000000 7 2 0.604585
49996 9999 114 -0.095290 1 1 0.000000 5 2 0.114185
49997 9999 114 -1.240715 2 0 0.000000 5 2 0.114185
49998 9999 114 0.701339 3 0 0.000000 5 2 0.114185
49999 9999 114 -0.610246 4 0 0.000000 5 2 0.114185

50000 rows × 9 columns

Restore original ids

To restore original ids, we need to make sure the dataframe is tracking ids as they change.

We make sure the dataframe tracks ids as they change by setting track_id_changes=True.

Notice that in this example we use j / 2, so that j will be modified during data cleaning.

The method .original_ids() will then return a dataframe that merges in the original ids.

[4]:
bdf_original_ids = bpd.BipartiteDataFrame(
    i=df['i'], j=(df['j'] / 2), y=df['y'], t=df['t'],
    track_id_changes=True
).clean()
display(bdf_original_ids.original_ids())
checking required columns and datatypes
sorting rows
dropping NaN observations
generating 'm' column
keeping highest paying job for i-t (worker-year) duplicates (how='max')
dropping workers who leave a firm then return to it (how=False)
making 'i' ids contiguous
making 'j' ids contiguous
computing largest connected set (how=None)
sorting columns
resetting index
i j y t m original_j
0 0 0 -0.884222 0 1 33.0
1 0 1 -2.224977 1 1 24.5
2 0 1 -1.711429 2 1 24.5
3 0 2 -0.082440 3 2 57.5
4 0 3 -1.190540 4 1 3.0
... ... ... ... ... ... ...
49995 9999 109 0.505200 0 1 73.0
49996 9999 28 -0.095290 1 1 57.0
49997 9999 28 -1.240715 2 0 57.0
49998 9999 28 0.701339 3 0 57.0
49999 9999 28 -0.610246 4 0 57.0

50000 rows × 6 columns

Retrieve and update custom column properties

Custom column properties can be retrived with the method .get_column_properties() and updated with the method .set_column_properties().

[5]:
display(bdf.get_column_properties('alpha'))
bdf = bdf.set_column_properties(
    'alpha', is_categorical=True, dtype='categorical'
)
display(bdf.get_column_properties('alpha'))
{'general_column': 'alpha',
 'subcolumns': 'alpha',
 'dtype': 'float',
 'is_categorical': False,
 'how_collapse': 'mean',
 'long_es_split': True}
{'general_column': 'alpha',
 'subcolumns': 'alpha',
 'dtype': 'categorical',
 'is_categorical': True,
 'how_collapse': 'first',
 'long_es_split': True}

Compare dataframes

Dataframes can be compared using the utility function bpd.util.compare_frames().

[6]:
bpd.util.compare_frames(
    bdf, bdf.iloc[:len(bdf) // 2],
    size_variable='len', operator='geq'
)
[6]:
True

Fill in missing periods as unemployed

The method .fill_missing_periods() (for Long format) will fill in rows for missing intermediate periods.

Hint

Filling in missing periods is a useful way to make sure that .collapse() only collapses over worker-firm spells if they are for categorical periods.

In this example, we drop periods 1-3, then fill them in, setting k (firm type) to become \(-1\) and l (worker type) to become its previous value:

[7]:
bdf_missing = bdf[
    (bdf['t'] == 0) | (bdf['t'] == 4)
].clean()
bdf_fill_missing = bdf_missing.fill_missing_periods(
    {
        'k': -1,
        'l': 'prev'
    }
)
display(bdf_fill_missing)
checking required columns and datatypes
sorting rows
dropping NaN observations
generating 'm' column
keeping highest paying job for i-t (worker-year) duplicates (how='max')
dropping workers who leave a firm then return to it (how=False)
making 'i' ids contiguous
making 'j' ids contiguous
making 'alpha' ids contiguous
computing largest connected set (how=None)
sorting columns
resetting index
i j y t m alpha k l psi
0 0 66 -0.884222 0 1 0 3 0 -0.348756
1 0 -1 <NA> 1 <NA> <NA> -1 0 <NA>
2 0 -1 <NA> 2 <NA> <NA> -1 0 <NA>
3 0 -1 <NA> 3 <NA> <NA> -1 0 <NA>
4 0 6 -1.19054 4 1 0 0 0 -1.335178
... ... ... ... ... ... ... ... ... ...
49995 9999 146 0.5052 0 1 3 7 2 0.604585
49996 9999 -1 <NA> 1 <NA> <NA> -1 2 <NA>
49997 9999 -1 <NA> 2 <NA> <NA> -1 2 <NA>
49998 9999 -1 <NA> 3 <NA> <NA> -1 2 <NA>
49999 9999 114 -0.610246 4 1 3 5 2 0.114185

50000 rows × 9 columns

Extended event studies

BipartitePandas allows you to use Long format data to generate event studies with more than 2 periods.

You can specify:

  • which column signals a transition (e.g. if j is used, a transition is when a worker moves firms)

  • which column(s) should be treated as the event study outcome

  • how many periods before and after the transition should be considered

  • whether the pre- and/or post-trends must be stable, and for which column(s)

We consider an example where j is the transition column, y is the outcome column, and with pre- and post-trends of length 2 that are required to be at the same firm. Note that y_f1 is the first observation after the individual moves firms.

[8]:
es_extended = bdf.get_extended_eventstudy(
    transition_col='j', outcomes='y',
    periods_pre=2, periods_post=2,
    stable_pre='j', stable_post='j'
)
display(es_extended)
i t y_l2 y_l1 y_f1 y_f2
0 7 2 2.660179 1.376463 1.965843 2.006426
1 15 2 1.033709 1.605745 2.222907 1.576384
2 28 3 0.814740 2.291087 2.511761 1.557221
3 33 3 2.097549 0.334050 0.727795 0.578112
4 35 2 -2.503758 -0.281162 0.201840 0.086961
... ... ... ... ... ... ...
2543 9988 2 0.062628 -0.286134 -0.183206 -1.367690
2544 9991 3 -3.105353 0.688329 0.432237 -0.346683
2545 9994 2 -0.650689 -3.249975 -2.090252 -2.282304
2546 9995 2 0.597124 -0.159967 1.380550 2.604956
2547 9996 2 -0.621562 -0.356489 -2.629698 -1.638295

2548 rows × 6 columns

Advanced data simulation

For details on all simulation parameters, run bpd.sim_params().describe_all(), or search through bpd.sim_params().keys() for a particular key, and then run bpd.sim_params().describe(key).