Advanced features
Import the BipartitePandas package
Make sure to install it using pip install bipartitepandas
.
[1]:
import bipartitepandas as bpd
Get your data ready
For this notebook, we simulate data.
[2]:
df = bpd.SimBipartite().simulate()
bdf = bpd.BipartiteDataFrame(
i=df['i'], j=df['j'], y=df['y'], t=df['t']
)
display(bdf)
i | j | y | t | |
---|---|---|---|---|
0 | 0 | 66 | -0.884222 | 0 |
1 | 0 | 49 | -2.224977 | 1 |
2 | 0 | 49 | -1.711429 | 2 |
3 | 0 | 115 | -0.082440 | 3 |
4 | 0 | 6 | -1.190540 | 4 |
... | ... | ... | ... | ... |
49995 | 9999 | 146 | 0.505200 | 0 |
49996 | 9999 | 114 | -0.095290 | 1 |
49997 | 9999 | 114 | -1.240715 | 2 |
49998 | 9999 | 114 | 0.701339 | 3 |
49999 | 9999 | 114 | -0.610246 | 4 |
50000 rows × 4 columns
Advanced data cleaning
Hint
Want details on all cleaning parameters? Run bpd.clean_params().describe_all()
, or search through bpd.clean_params().keys()
for a particular key, and then run bpd.clean_params().describe(key)
.
Set how to handle worker-year duplicates
Use the parameter i_t_how
to customize how worker-year duplicates are handled.
Collapse at the match-level
If you drop the t
column, collapsing will automatically collapse at the match level. However, this prevents conversion to event study format (this can be bypassed with the .construct_artificial_time()
method, but the data will likely have a meaningless order, rendering the event study uninterpretable).
Avoid unnecessary copies
If you are working with a large dataset, you will want to avoid copies whenever possible. So set copy=False
.
Avoid unnecessary sorts
If you know your data is sorted by i
and t
(or, if you aren’t including a t
column, just by i
), then set is_sorted=True
.
Avoid complicated loops
Sometimes workers leave a firm, then return to it (we call these workers returners). Returners can cause computational difficulties because sometimes intermediate firms get dropped (e.g. a worker goes from firm \(A \to B \to A\), but firm \(B\) gets dropped). This turns returners into stayers. This can change the largest connected set of firms, and if data is in collapsed format, requires the data to be recollapsed.
Because of these potential complications, if there are returners, many methods require loops that run until convergence.
These difficulties can be avoided by setting the parameter drop_returns
(there are multiple ways to handle returners, they can be seen by running bpd.clean_params().describe('drop_returns')
).
Alternative
Another way to handle returners is to drop the t
column. Then, sorting will automatically sort by i
and j
, which eliminates the possibility of returners. However, this prevents conversion to event study format (this can be bypassed with the .construct_artificial_time()
method, but the data will likely have a meaningless order, rendering the event study uninterpretable).
Advanced clustering
Install Intel(R) Extension for Scikit-learn
Intel(R) Extension for Scikit-learn (GitHub) can speed up KMeans clustering.
Advanced dataframe handling
Disable logging
Logging can slow down basic operations on BipartitePandas dataframes (e.g. data cleaning). Set the parameter log=False
when constructing your dataframe to turn off logging.
Use method chaining with in-place operations
Unlike standard Pandas, BipartitePandas allows method chaining with in-place operations.
Understand the difference between general columns and subcolumns
Users interact with general columns, while BipartitePandas dataframes display subcolumns. As an example, for event study format, the columns for firm clusters are labeled g1
and g2
. These are the subcolumns for general column g
. If you want to drop firm clusters from the dataframe, rather than dropping g1
and g2
separately, you must drop the general column g
. This paradigm applies throughout BipartitePandas and the documentation will make clear when you should specify
general columns.
Bypass restrictions
Sometimes BipartitePandas imposes restrictions that you may want to bypass. While BipartitePandas does not provide an explicit way to disable these restrictions, you can bypass them by converting your data into a Pandas dataframe, running the code that was formerly restricted, then converting it back into a BipartitePandas dataframe.
Simpler constructor
If the columns in your Pandas dataframe are already named correctly, you can simply put the dataframe as a parameter into the BipartitePandas dataframe constructor. Here is an example:
[3]:
bdf = bpd.BipartiteDataFrame(df).clean()
display(bdf)
checking required columns and datatypes
sorting rows
dropping NaN observations
generating 'm' column
keeping highest paying job for i-t (worker-year) duplicates (how='max')
dropping workers who leave a firm then return to it (how=False)
making 'i' ids contiguous
making 'j' ids contiguous
computing largest connected set (how=None)
sorting columns
resetting index
i | j | y | t | m | alpha | k | l | psi | |
---|---|---|---|---|---|---|---|---|---|
0 | 0 | 66 | -0.884222 | 0 | 1 | -0.967422 | 3 | 0 | -0.348756 |
1 | 0 | 49 | -2.224977 | 1 | 1 | -0.967422 | 2 | 0 | -0.604585 |
2 | 0 | 49 | -1.711429 | 2 | 1 | -0.967422 | 2 | 0 | -0.604585 |
3 | 0 | 115 | -0.082440 | 3 | 2 | -0.967422 | 5 | 0 | 0.114185 |
4 | 0 | 6 | -1.190540 | 4 | 1 | -0.967422 | 0 | 0 | -1.335178 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
49995 | 9999 | 146 | 0.505200 | 0 | 1 | 0.000000 | 7 | 2 | 0.604585 |
49996 | 9999 | 114 | -0.095290 | 1 | 1 | 0.000000 | 5 | 2 | 0.114185 |
49997 | 9999 | 114 | -1.240715 | 2 | 0 | 0.000000 | 5 | 2 | 0.114185 |
49998 | 9999 | 114 | 0.701339 | 3 | 0 | 0.000000 | 5 | 2 | 0.114185 |
49999 | 9999 | 114 | -0.610246 | 4 | 0 | 0.000000 | 5 | 2 | 0.114185 |
50000 rows × 9 columns
Restore original ids
To restore original ids, we need to make sure the dataframe is tracking ids as they change.
We make sure the dataframe tracks ids as they change by setting track_id_changes=True
.
Notice that in this example we use j / 2
, so that j
will be modified during data cleaning.
The method .original_ids()
will then return a dataframe that merges in the original ids.
[4]:
bdf_original_ids = bpd.BipartiteDataFrame(
i=df['i'], j=(df['j'] / 2), y=df['y'], t=df['t'],
track_id_changes=True
).clean()
display(bdf_original_ids.original_ids())
checking required columns and datatypes
sorting rows
dropping NaN observations
generating 'm' column
keeping highest paying job for i-t (worker-year) duplicates (how='max')
dropping workers who leave a firm then return to it (how=False)
making 'i' ids contiguous
making 'j' ids contiguous
computing largest connected set (how=None)
sorting columns
resetting index
i | j | y | t | m | original_j | |
---|---|---|---|---|---|---|
0 | 0 | 0 | -0.884222 | 0 | 1 | 33.0 |
1 | 0 | 1 | -2.224977 | 1 | 1 | 24.5 |
2 | 0 | 1 | -1.711429 | 2 | 1 | 24.5 |
3 | 0 | 2 | -0.082440 | 3 | 2 | 57.5 |
4 | 0 | 3 | -1.190540 | 4 | 1 | 3.0 |
... | ... | ... | ... | ... | ... | ... |
49995 | 9999 | 109 | 0.505200 | 0 | 1 | 73.0 |
49996 | 9999 | 28 | -0.095290 | 1 | 1 | 57.0 |
49997 | 9999 | 28 | -1.240715 | 2 | 0 | 57.0 |
49998 | 9999 | 28 | 0.701339 | 3 | 0 | 57.0 |
49999 | 9999 | 28 | -0.610246 | 4 | 0 | 57.0 |
50000 rows × 6 columns
Retrieve and update custom column properties
Custom column properties can be retrived with the method .get_column_properties()
and updated with the method .set_column_properties()
.
[5]:
display(bdf.get_column_properties('alpha'))
bdf = bdf.set_column_properties(
'alpha', is_categorical=True, dtype='categorical'
)
display(bdf.get_column_properties('alpha'))
{'general_column': 'alpha',
'subcolumns': 'alpha',
'dtype': 'float',
'is_categorical': False,
'how_collapse': 'mean',
'long_es_split': True}
{'general_column': 'alpha',
'subcolumns': 'alpha',
'dtype': 'categorical',
'is_categorical': True,
'how_collapse': 'first',
'long_es_split': True}
Compare dataframes
Dataframes can be compared using the utility function bpd.util.compare_frames()
.
[6]:
bpd.util.compare_frames(
bdf, bdf.iloc[:len(bdf) // 2],
size_variable='len', operator='geq'
)
[6]:
True
Fill in missing periods as unemployed
The method .fill_missing_periods()
(for Long format) will fill in rows for missing intermediate periods.
Hint
Filling in missing periods is a useful way to make sure that .collapse()
only collapses over worker-firm spells if they are for categorical periods.
In this example, we drop periods 1-3, then fill them in, setting k
(firm type) to become \(-1\) and l
(worker type) to become its previous value:
[7]:
bdf_missing = bdf[
(bdf['t'] == 0) | (bdf['t'] == 4)
].clean()
bdf_fill_missing = bdf_missing.fill_missing_periods(
{
'k': -1,
'l': 'prev'
}
)
display(bdf_fill_missing)
checking required columns and datatypes
sorting rows
dropping NaN observations
generating 'm' column
keeping highest paying job for i-t (worker-year) duplicates (how='max')
dropping workers who leave a firm then return to it (how=False)
making 'i' ids contiguous
making 'j' ids contiguous
making 'alpha' ids contiguous
computing largest connected set (how=None)
sorting columns
resetting index
i | j | y | t | m | alpha | k | l | psi | |
---|---|---|---|---|---|---|---|---|---|
0 | 0 | 66 | -0.884222 | 0 | 1 | 0 | 3 | 0 | -0.348756 |
1 | 0 | -1 | <NA> | 1 | <NA> | <NA> | -1 | 0 | <NA> |
2 | 0 | -1 | <NA> | 2 | <NA> | <NA> | -1 | 0 | <NA> |
3 | 0 | -1 | <NA> | 3 | <NA> | <NA> | -1 | 0 | <NA> |
4 | 0 | 6 | -1.19054 | 4 | 1 | 0 | 0 | 0 | -1.335178 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
49995 | 9999 | 146 | 0.5052 | 0 | 1 | 3 | 7 | 2 | 0.604585 |
49996 | 9999 | -1 | <NA> | 1 | <NA> | <NA> | -1 | 2 | <NA> |
49997 | 9999 | -1 | <NA> | 2 | <NA> | <NA> | -1 | 2 | <NA> |
49998 | 9999 | -1 | <NA> | 3 | <NA> | <NA> | -1 | 2 | <NA> |
49999 | 9999 | 114 | -0.610246 | 4 | 1 | 3 | 5 | 2 | 0.114185 |
50000 rows × 9 columns
Extended event studies
BipartitePandas allows you to use Long format data to generate event studies with more than 2 periods.
You can specify:
which column signals a transition (e.g. if
j
is used, a transition is when a worker moves firms)which column(s) should be treated as the event study outcome
how many periods before and after the transition should be considered
whether the pre- and/or post-trends must be stable, and for which column(s)
We consider an example where j
is the transition column, y
is the outcome column, and with pre- and post-trends of length 2 that are required to be at the same firm. Note that y_f1
is the first observation after the individual moves firms.
[8]:
es_extended = bdf.get_extended_eventstudy(
transition_col='j', outcomes='y',
periods_pre=2, periods_post=2,
stable_pre='j', stable_post='j'
)
display(es_extended)
i | t | y_l2 | y_l1 | y_f1 | y_f2 | |
---|---|---|---|---|---|---|
0 | 7 | 2 | 2.660179 | 1.376463 | 1.965843 | 2.006426 |
1 | 15 | 2 | 1.033709 | 1.605745 | 2.222907 | 1.576384 |
2 | 28 | 3 | 0.814740 | 2.291087 | 2.511761 | 1.557221 |
3 | 33 | 3 | 2.097549 | 0.334050 | 0.727795 | 0.578112 |
4 | 35 | 2 | -2.503758 | -0.281162 | 0.201840 | 0.086961 |
... | ... | ... | ... | ... | ... | ... |
2543 | 9988 | 2 | 0.062628 | -0.286134 | -0.183206 | -1.367690 |
2544 | 9991 | 3 | -3.105353 | 0.688329 | 0.432237 | -0.346683 |
2545 | 9994 | 2 | -0.650689 | -3.249975 | -2.090252 | -2.282304 |
2546 | 9995 | 2 | 0.597124 | -0.159967 | 1.380550 | 2.604956 |
2547 | 9996 | 2 | -0.621562 | -0.356489 | -2.629698 | -1.638295 |
2548 rows × 6 columns
Advanced data simulation
For details on all simulation parameters, run bpd.sim_params().describe_all()
, or search through bpd.sim_params().keys()
for a particular key, and then run bpd.sim_params().describe(key)
.