Advanced features

Import the BipartitePandas package

Make sure to install it using pip install bipartitepandas.

[1]:

import bipartitepandas as bpd

Get your data ready

For this notebook, we simulate data.

[2]:

df = bpd.SimBipartite().simulate()
bdf = bpd.BipartiteDataFrame(
    i=df['i'], j=df['j'], y=df['y'], t=df['t']
)
display(bdf)

	i	j	y	t
0	0	66	-0.884222	0
1	0	49	-2.224977	1
2	0	49	-1.711429	2
3	0	115	-0.082440	3
4	0	6	-1.190540	4
...	...	...	...	...
49995	9999	146	0.505200	0
49996	9999	114	-0.095290	1
49997	9999	114	-1.240715	2
49998	9999	114	0.701339	3
49999	9999	114	-0.610246	4

50000 rows × 4 columns

Advanced data cleaning

Hint

Want details on all cleaning parameters? Run bpd.clean_params().describe_all(), or search through bpd.clean_params().keys() for a particular key, and then run bpd.clean_params().describe(key).

Set how to handle worker-year duplicates

Use the parameter i_t_how to customize how worker-year duplicates are handled.

Collapse at the match-level

If you drop the t column, collapsing will automatically collapse at the match level. However, this prevents conversion to event study format (this can be bypassed with the .construct_artificial_time() method, but the data will likely have a meaningless order, rendering the event study uninterpretable).

Avoid unnecessary copies

If you are working with a large dataset, you will want to avoid copies whenever possible. So set copy=False.

Avoid unnecessary sorts

If you know your data is sorted by i and t (or, if you aren’t including a t column, just by i), then set is_sorted=True.

Avoid complicated loops

Sometimes workers leave a firm, then return to it (we call these workers returners). Returners can cause computational difficulties because sometimes intermediate firms get dropped (e.g. a worker goes from firm \(A \to B \to A\), but firm \(B\) gets dropped). This turns returners into stayers. This can change the largest connected set of firms, and if data is in collapsed format, requires the data to be recollapsed.

Because of these potential complications, if there are returners, many methods require loops that run until convergence.

These difficulties can be avoided by setting the parameter drop_returns (there are multiple ways to handle returners, they can be seen by running bpd.clean_params().describe('drop_returns')).

Alternative

Another way to handle returners is to drop the t column. Then, sorting will automatically sort by i and j, which eliminates the possibility of returners. However, this prevents conversion to event study format (this can be bypassed with the .construct_artificial_time() method, but the data will likely have a meaningless order, rendering the event study uninterpretable).

Advanced clustering

Install Intel(R) Extension for Scikit-learn

Intel(R) Extension for Scikit-learn (GitHub) can speed up KMeans clustering.

Advanced dataframe handling

Disable logging

Logging can slow down basic operations on BipartitePandas dataframes (e.g. data cleaning). Set the parameter log=False when constructing your dataframe to turn off logging.

Use method chaining with in-place operations

Unlike standard Pandas, BipartitePandas allows method chaining with in-place operations.

Understand the difference between general columns and subcolumns

Users interact with general columns, while BipartitePandas dataframes display subcolumns. As an example, for event study format, the columns for firm clusters are labeled g1 and g2. These are the subcolumns for general column g. If you want to drop firm clusters from the dataframe, rather than dropping g1 and g2 separately, you must drop the general column g. This paradigm applies throughout BipartitePandas and the documentation will make clear when you should specify general columns.

Bypass restrictions

Sometimes BipartitePandas imposes restrictions that you may want to bypass. While BipartitePandas does not provide an explicit way to disable these restrictions, you can bypass them by converting your data into a Pandas dataframe, running the code that was formerly restricted, then converting it back into a BipartitePandas dataframe.

Simpler constructor

If the columns in your Pandas dataframe are already named correctly, you can simply put the dataframe as a parameter into the BipartitePandas dataframe constructor. Here is an example:

[3]:

bdf = bpd.BipartiteDataFrame(df).clean()
display(bdf)

checking required columns and datatypes
sorting rows
dropping NaN observations
generating 'm' column
keeping highest paying job for i-t (worker-year) duplicates (how='max')
dropping workers who leave a firm then return to it (how=False)
making 'i' ids contiguous
making 'j' ids contiguous
computing largest connected set (how=None)
sorting columns
resetting index

	i	j	y	t	m	alpha	k	l	psi
0	0	66	-0.884222	0	1	-0.967422	3	0	-0.348756
1	0	49	-2.224977	1	1	-0.967422	2	0	-0.604585
2	0	49	-1.711429	2	1	-0.967422	2	0	-0.604585
3	0	115	-0.082440	3	2	-0.967422	5	0	0.114185
4	0	6	-1.190540	4	1	-0.967422	0	0	-1.335178
...	...	...	...	...	...	...	...	...	...
49995	9999	146	0.505200	0	1	0.000000	7	2	0.604585
49996	9999	114	-0.095290	1	1	0.000000	5	2	0.114185
49997	9999	114	-1.240715	2	0	0.000000	5	2	0.114185
49998	9999	114	0.701339	3	0	0.000000	5	2	0.114185
49999	9999	114	-0.610246	4	0	0.000000	5	2	0.114185

50000 rows × 9 columns

Restore original ids

To restore original ids, we need to make sure the dataframe is tracking ids as they change.

We make sure the dataframe tracks ids as they change by setting track_id_changes=True.

Notice that in this example we use j / 2, so that j will be modified during data cleaning.

The method .original_ids() will then return a dataframe that merges in the original ids.

[4]:

bdf_original_ids = bpd.BipartiteDataFrame(
    i=df['i'], j=(df['j'] / 2), y=df['y'], t=df['t'],
    track_id_changes=True
).clean()
display(bdf_original_ids.original_ids())

checking required columns and datatypes
sorting rows
dropping NaN observations
generating 'm' column
keeping highest paying job for i-t (worker-year) duplicates (how='max')
dropping workers who leave a firm then return to it (how=False)
making 'i' ids contiguous
making 'j' ids contiguous
computing largest connected set (how=None)
sorting columns
resetting index

	i	j	y	t	m	original_j
0	0	0	-0.884222	0	1	33.0
1	0	1	-2.224977	1	1	24.5
2	0	1	-1.711429	2	1	24.5
3	0	2	-0.082440	3	2	57.5
4	0	3	-1.190540	4	1	3.0
...	...	...	...	...	...	...
49995	9999	109	0.505200	0	1	73.0
49996	9999	28	-0.095290	1	1	57.0
49997	9999	28	-1.240715	2	0	57.0
49998	9999	28	0.701339	3	0	57.0
49999	9999	28	-0.610246	4	0	57.0

50000 rows × 6 columns

Retrieve and update custom column properties

Custom column properties can be retrived with the method .get_column_properties() and updated with the method .set_column_properties().

[5]:

display(bdf.get_column_properties('alpha'))
bdf = bdf.set_column_properties(
    'alpha', is_categorical=True, dtype='categorical'
)
display(bdf.get_column_properties('alpha'))

{'general_column': 'alpha',
 'subcolumns': 'alpha',
 'dtype': 'float',
 'is_categorical': False,
 'how_collapse': 'mean',
 'long_es_split': True}

{'general_column': 'alpha',
 'subcolumns': 'alpha',
 'dtype': 'categorical',
 'is_categorical': True,
 'how_collapse': 'first',
 'long_es_split': True}

Compare dataframes

Dataframes can be compared using the utility function bpd.util.compare_frames().

[6]:

bpd.util.compare_frames(
    bdf, bdf.iloc[:len(bdf) // 2],
    size_variable='len', operator='geq'
)

[6]:

True

Fill in missing periods as unemployed

The method .fill_missing_periods() (for Long format) will fill in rows for missing intermediate periods.

Hint

Filling in missing periods is a useful way to make sure that .collapse() only collapses over worker-firm spells if they are for categorical periods.

In this example, we drop periods 1-3, then fill them in, setting k (firm type) to become \(-1\) and l (worker type) to become its previous value:

[7]:

bdf_missing = bdf[
    (bdf['t'] == 0) | (bdf['t'] == 4)
].clean()
bdf_fill_missing = bdf_missing.fill_missing_periods(
    {
        'k': -1,
        'l': 'prev'
    }
)
display(bdf_fill_missing)

checking required columns and datatypes
sorting rows
dropping NaN observations
generating 'm' column
keeping highest paying job for i-t (worker-year) duplicates (how='max')
dropping workers who leave a firm then return to it (how=False)
making 'i' ids contiguous
making 'j' ids contiguous
making 'alpha' ids contiguous
computing largest connected set (how=None)
sorting columns
resetting index

	i	j	y	t	m	alpha	k	l	psi
0	0	66	-0.884222	0	1	0	3	0	-0.348756
1	0	-1	<NA>	1	<NA>	<NA>	-1	0	<NA>
2	0	-1	<NA>	2	<NA>	<NA>	-1	0	<NA>
3	0	-1	<NA>	3	<NA>	<NA>	-1	0	<NA>
4	0	6	-1.19054	4	1	0	0	0	-1.335178
...	...	...	...	...	...	...	...	...	...
49995	9999	146	0.5052	0	1	3	7	2	0.604585
49996	9999	-1	<NA>	1	<NA>	<NA>	-1	2	<NA>
49997	9999	-1	<NA>	2	<NA>	<NA>	-1	2	<NA>
49998	9999	-1	<NA>	3	<NA>	<NA>	-1	2	<NA>
49999	9999	114	-0.610246	4	1	3	5	2	0.114185

50000 rows × 9 columns

Extended event studies

BipartitePandas allows you to use Long format data to generate event studies with more than 2 periods.

You can specify:

which column signals a transition (e.g. if j is used, a transition is when a worker moves firms)
which column(s) should be treated as the event study outcome
how many periods before and after the transition should be considered
whether the pre- and/or post-trends must be stable, and for which column(s)

We consider an example where j is the transition column, y is the outcome column, and with pre- and post-trends of length 2 that are required to be at the same firm. Note that y_f1 is the first observation after the individual moves firms.

[8]:

es_extended = bdf.get_extended_eventstudy(
    transition_col='j', outcomes='y',
    periods_pre=2, periods_post=2,
    stable_pre='j', stable_post='j'
)
display(es_extended)

	i	t	y_l2	y_l1	y_f1	y_f2
0	7	2	2.660179	1.376463	1.965843	2.006426
1	15	2	1.033709	1.605745	2.222907	1.576384
2	28	3	0.814740	2.291087	2.511761	1.557221
3	33	3	2.097549	0.334050	0.727795	0.578112
4	35	2	-2.503758	-0.281162	0.201840	0.086961
...	...	...	...	...	...	...
2543	9988	2	0.062628	-0.286134	-0.183206	-1.367690
2544	9991	3	-3.105353	0.688329	0.432237	-0.346683
2545	9994	2	-0.650689	-3.249975	-2.090252	-2.282304
2546	9995	2	0.597124	-0.159967	1.380550	2.604956
2547	9996	2	-0.621562	-0.356489	-2.629698	-1.638295

2548 rows × 6 columns

Advanced data simulation

For details on all simulation parameters, run bpd.sim_params().describe_all(), or search through bpd.sim_params().keys() for a particular key, and then run bpd.sim_params().describe(key).