Custom columns
Why do we care?
Specifying custom columns correctly in BipartitePandas
is very important. This ensures custom columns interact with classes and methods properly - otherwise, conversions between types are likely to drop these columns, and method calls may not apply to these columns, they may apply incorrectly, or they may raise errors.
Import the BipartitePandas package
Make sure to install it using pip install bipartitepandas
.
[1]:
import bipartitepandas as bpd
Get your data ready
For this notebook, we simulate data.
[2]:
df = bpd.SimBipartite().simulate()
display(df)
i | j | y | t | l | k | alpha | psi | |
---|---|---|---|---|---|---|---|---|
0 | 0 | 36 | -1.569165 | 0 | 2 | 1 | 0.000000 | -0.908458 |
1 | 0 | 17 | 2.442324 | 1 | 2 | 0 | 0.000000 | -1.335178 |
2 | 0 | 17 | -1.307551 | 2 | 2 | 0 | 0.000000 | -1.335178 |
3 | 0 | 17 | -1.551354 | 3 | 2 | 0 | 0.000000 | -1.335178 |
4 | 0 | 13 | -0.789661 | 4 | 2 | 0 | 0.000000 | -1.335178 |
... | ... | ... | ... | ... | ... | ... | ... | ... |
49995 | 9999 | 102 | -1.493225 | 0 | 1 | 5 | -0.430727 | 0.114185 |
49996 | 9999 | 116 | 2.368321 | 1 | 1 | 5 | -0.430727 | 0.114185 |
49997 | 9999 | 76 | -2.070787 | 2 | 1 | 3 | -0.430727 | -0.348756 |
49998 | 9999 | 23 | -1.203733 | 3 | 1 | 1 | -0.430727 | -0.908458 |
49999 | 9999 | 23 | 0.132797 | 4 | 1 | 1 | -0.430727 | -0.908458 |
50000 rows × 8 columns
Columns
BipartitePandas includes seven pre-defined general columns:
Required
i
: worker id (any type)j
: firm id (any type)y
: income (float or int)
Optional
t
: time (int)g
: firm type (any type)w
: weight (float or int)m
: move indicator (int)
Formats
BipartitePandas includes four formats:
Long - each row gives a single observation
Collapsed Long - like Long, but employment spells at the same firm are collapsed into a single observation
Event Study - each row gives two consecutive observations
Collapsed Event Study - like Event Study, but employment spells at the same firm are collapsed into a single observation
These formats divide general columns differently:
Long -
i
,j
,y
,t
,g
,w
,m
Collapsed Long -
i
,j
,y
,t1
,t2
,g
,w
,m
Event Study -
i
,j1
,j2
,y1
,y2
,t1
,t2
,g1
,g2
,w1
,w2
,m
Collapsed Event Study -
i
,j1
,j2
,y1
,y2
,t11
,t12
,t21
,t22
,g1
,g2
,w1
,w2
,m
Constructing DataFrames
Our simulated data is in Long format, but includes columns that aren’t pre-defined. How do we construct a Long dataframe that includes these columns?
[3]:
bdf_long = bpd.BipartiteDataFrame(
i=df['i'], j=df['j'], y=df['y'], t=df['t'],
l=df['l'], k=df['k'], alpha=df['alpha'], psi=df['psi']
)
display(bdf_long)
i | j | y | t | alpha | k | l | psi | |
---|---|---|---|---|---|---|---|---|
0 | 0 | 36 | -1.569165 | 0 | 0.000000 | 1 | 2 | -0.908458 |
1 | 0 | 17 | 2.442324 | 1 | 0.000000 | 0 | 2 | -1.335178 |
2 | 0 | 17 | -1.307551 | 2 | 0.000000 | 0 | 2 | -1.335178 |
3 | 0 | 17 | -1.551354 | 3 | 0.000000 | 0 | 2 | -1.335178 |
4 | 0 | 13 | -0.789661 | 4 | 0.000000 | 0 | 2 | -1.335178 |
... | ... | ... | ... | ... | ... | ... | ... | ... |
49995 | 9999 | 102 | -1.493225 | 0 | -0.430727 | 5 | 1 | 0.114185 |
49996 | 9999 | 116 | 2.368321 | 1 | -0.430727 | 5 | 1 | 0.114185 |
49997 | 9999 | 76 | -2.070787 | 2 | -0.430727 | 3 | 1 | -0.348756 |
49998 | 9999 | 23 | -1.203733 | 3 | -0.430727 | 1 | 1 | -0.908458 |
49999 | 9999 | 23 | 0.132797 | 4 | -0.430727 | 1 | 1 | -0.908458 |
50000 rows × 8 columns
Are we sure this is long? Let’s check the datatype:
[4]:
type(bdf_long)
[4]:
bipartitepandas.bipartitelong.BipartiteLong
Categorical columns
What if we want to specify a column should be categorical? Then we should specify custom_categorical_dict
!
Note
alpha
is float, and BipartiteDataFrame automatically sets floats to collapse by mean
. Categorical columns cannot be collapsed by mean, so if we mark alpha
as categorical, we must also specify that it should collapse by first
(last
or None
also work). In addition, categorical columns must use the datatype 'categorical'
.
[5]:
bdf_long = bpd.BipartiteDataFrame(
i=df['i'], j=df['j'], y=df['y'], t=df['t'],
l=df['l'], k=df['k'], alpha=df['alpha'], psi=df['psi'],
custom_categorical_dict={'alpha': True},
custom_dtype_dict={'alpha': 'categorical'},
custom_how_collapse_dict={'alpha': 'first'}
).clean()
display(bdf_long)
checking required columns and datatypes
sorting rows
dropping NaN observations
generating 'm' column
keeping highest paying job for i-t (worker-year) duplicates (how='max')
dropping workers who leave a firm then return to it (how=False)
making 'i' ids contiguous
making 'j' ids contiguous
making 'alpha' ids contiguous
computing largest connected set (how=None)
sorting columns
resetting index
i | j | y | t | m | alpha | k | l | psi | |
---|---|---|---|---|---|---|---|---|---|
0 | 0 | 36 | -1.569165 | 0 | 1 | 0 | 1 | 2 | -0.908458 |
1 | 0 | 17 | 2.442324 | 1 | 1 | 0 | 0 | 2 | -1.335178 |
2 | 0 | 17 | -1.307551 | 2 | 0 | 0 | 0 | 2 | -1.335178 |
3 | 0 | 17 | -1.551354 | 3 | 1 | 0 | 0 | 2 | -1.335178 |
4 | 0 | 13 | -0.789661 | 4 | 1 | 0 | 0 | 2 | -1.335178 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
49995 | 9999 | 102 | -1.493225 | 0 | 1 | 1 | 5 | 1 | 0.114185 |
49996 | 9999 | 116 | 2.368321 | 1 | 2 | 1 | 5 | 1 | 0.114185 |
49997 | 9999 | 76 | -2.070787 | 2 | 2 | 1 | 3 | 1 | -0.348756 |
49998 | 9999 | 23 | -1.203733 | 3 | 1 | 1 | 1 | 1 | -0.908458 |
49999 | 9999 | 23 | 0.132797 | 4 | 0 | 1 | 1 | 1 | -0.908458 |
50000 rows × 9 columns
Collapsing data
What if instead of collapsing by the mean
, we want a column to collapse by first
, or even to drop when we collapse? Then we should specify custom_how_collapse_dict
!
[6]:
bdf_long = bpd.BipartiteDataFrame(
i=df['i'], j=df['j'], y=df['y'], t=df['t'],
l=df['l'], k=df['k'], alpha=df['alpha'], psi=df['psi'],
custom_how_collapse_dict={'alpha': None, 'psi': 'first'}
).clean().collapse()
display(bdf_long)
checking required columns and datatypes
sorting rows
dropping NaN observations
generating 'm' column
keeping highest paying job for i-t (worker-year) duplicates (how='max')
dropping workers who leave a firm then return to it (how=False)
making 'i' ids contiguous
making 'j' ids contiguous
computing largest connected set (how=None)
sorting columns
resetting index
i | j | y | t1 | t2 | w | m | k | l | psi | |
---|---|---|---|---|---|---|---|---|---|---|
0 | 0 | 36 | -1.569165 | 0 | 0 | 1 | 1 | 1.0 | 2.0 | -0.908458 |
1 | 0 | 17 | -0.138861 | 1 | 3 | 3 | 2 | 0.0 | 2.0 | -1.335178 |
2 | 0 | 13 | -0.789661 | 4 | 4 | 1 | 1 | 0.0 | 2.0 | -1.335178 |
3 | 1 | 52 | -0.653218 | 0 | 0 | 1 | 1 | 2.0 | 1.0 | -0.604585 |
4 | 1 | 49 | 0.676861 | 1 | 2 | 2 | 2 | 2.0 | 1.0 | -0.604585 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
29820 | 9998 | 52 | -2.618451 | 4 | 4 | 1 | 1 | 2.0 | 0.0 | -0.604585 |
29821 | 9999 | 102 | -1.493225 | 0 | 0 | 1 | 1 | 5.0 | 1.0 | 0.114185 |
29822 | 9999 | 116 | 2.368321 | 1 | 1 | 1 | 2 | 5.0 | 1.0 | 0.114185 |
29823 | 9999 | 76 | -2.070787 | 2 | 2 | 1 | 2 | 3.0 | 1.0 | -0.348756 |
29824 | 9999 | 23 | -0.535468 | 3 | 4 | 2 | 1 | 1.0 | 1.0 | -0.908458 |
29825 rows × 10 columns
Warning
Collapsing by first
, last
, mean
, and sum
will uncollapse correctly (although information may be lost); any other option (e.g. var
or std
) is not guaranteed to uncollapse correctly.
Converting between (collapsed) long and (collapsed) event study formats
What if we don’t want a column to split when converting to event study, or if we want it to drop during the conversion? Then we should specify custom_long_es_split_dict
!
[7]:
bdf_long = bpd.BipartiteDataFrame(
i=df['i'], j=df['j'], y=df['y'], t=df['t'],
l=df['l'], k=df['k'], alpha=df['alpha'], psi=df['psi'],
custom_long_es_split_dict={'alpha': False, 'psi': None}
).clean().to_eventstudy()
display(bdf_long)
checking required columns and datatypes
sorting rows
dropping NaN observations
generating 'm' column
keeping highest paying job for i-t (worker-year) duplicates (how='max')
dropping workers who leave a firm then return to it (how=False)
making 'i' ids contiguous
making 'j' ids contiguous
computing largest connected set (how=None)
sorting columns
resetting index
i | j1 | j2 | y1 | y2 | t1 | t2 | m | alpha | k1 | k2 | l1 | l2 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0 | 36 | 17 | -1.569165 | 2.442324 | 0 | 1 | 1 | 0.000000 | 1 | 0 | 2 | 2 |
1 | 0 | 17 | 17 | 2.442324 | -1.307551 | 1 | 2 | 0 | 0.000000 | 0 | 0 | 2 | 2 |
2 | 0 | 17 | 17 | -1.307551 | -1.551354 | 2 | 3 | 0 | 0.000000 | 0 | 0 | 2 | 2 |
3 | 0 | 17 | 13 | -1.551354 | -0.789661 | 3 | 4 | 1 | 0.000000 | 0 | 0 | 2 | 2 |
4 | 1 | 52 | 49 | -0.653218 | 1.597527 | 0 | 1 | 1 | -0.430727 | 2 | 2 | 1 | 1 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
40657 | 9998 | 2 | 52 | -2.069815 | -2.618451 | 3 | 4 | 1 | -0.967422 | 0 | 2 | 0 | 0 |
40658 | 9999 | 102 | 116 | -1.493225 | 2.368321 | 0 | 1 | 1 | -0.430727 | 5 | 5 | 1 | 1 |
40659 | 9999 | 116 | 76 | 2.368321 | -2.070787 | 1 | 2 | 1 | -0.430727 | 5 | 3 | 1 | 1 |
40660 | 9999 | 76 | 23 | -2.070787 | -1.203733 | 2 | 3 | 1 | -0.430727 | 3 | 1 | 1 | 1 |
40661 | 9999 | 23 | 23 | -1.203733 | 0.132797 | 3 | 4 | 0 | -0.430727 | 1 | 1 | 1 | 1 |
40662 rows × 13 columns
Adding custom columns to an instantiated DataFrame
Use the method .add_column()
to add custom columns to a DataFrame that has already been instantiated.
[8]:
bdf_long = bpd.BipartiteDataFrame(
i=df['i'], j=df['j'], y=df['y'], t=df['t'],
l=df['l'], k=df['k']
)
bdf_long = bdf_long.add_column('alpha', df['alpha'], is_categorical=True, dtype='categorical')
bdf_long = bdf_long.add_column('psi', df['psi'], is_categorical=True, dtype='categorical')
bdf_long = bdf_long.clean()
display(bdf_long)
checking required columns and datatypes
sorting rows
dropping NaN observations
generating 'm' column
keeping highest paying job for i-t (worker-year) duplicates (how='max')
dropping workers who leave a firm then return to it (how=False)
making 'i' ids contiguous
making 'j' ids contiguous
making 'alpha' ids contiguous
making 'psi' ids contiguous
computing largest connected set (how=None)
sorting columns
resetting index
i | j | y | t | m | alpha | k | l | psi | |
---|---|---|---|---|---|---|---|---|---|
0 | 0 | 36 | -1.569165 | 0 | 1 | 0 | 1 | 2 | 0 |
1 | 0 | 17 | 2.442324 | 1 | 1 | 0 | 0 | 2 | 1 |
2 | 0 | 17 | -1.307551 | 2 | 0 | 0 | 0 | 2 | 1 |
3 | 0 | 17 | -1.551354 | 3 | 1 | 0 | 0 | 2 | 1 |
4 | 0 | 13 | -0.789661 | 4 | 1 | 0 | 0 | 2 | 1 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
49995 | 9999 | 102 | -1.493225 | 0 | 1 | 1 | 5 | 1 | 4 |
49996 | 9999 | 116 | 2.368321 | 1 | 2 | 1 | 5 | 1 | 4 |
49997 | 9999 | 76 | -2.070787 | 2 | 2 | 1 | 3 | 1 | 6 |
49998 | 9999 | 23 | -1.203733 | 3 | 1 | 1 | 1 | 1 | 0 |
49999 | 9999 | 23 | 0.132797 | 4 | 0 | 1 | 1 | 1 | 0 |
50000 rows × 9 columns
Here, we see what happens if we add custom columns incorrectly. In this example, we see that cleaning will raise an error. If instead, we try to bypass data cleaning and immediately convert between data formats, the custom columns will be dropped during the conversion.
[9]:
bdf_long = bpd.BipartiteDataFrame(
i=df['i'], j=df['j'], y=df['y'], t=df['t'],
l=df['l'], k=df['k']
)
bdf_long['alpha'] = df['alpha']
bdf_long['psi'] = df['psi']
bdf_long = bdf_long.clean()
checking required columns and datatypes
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
Input In [9], in <cell line: 7>()
5 bdf_long['alpha'] = df['alpha']
6 bdf_long['psi'] = df['psi']
----> 7 bdf_long = bdf_long.clean()
File ~/opt/anaconda3/envs/stata-env/lib/python3.9/site-packages/bipartitepandas/bipartitelong.py:129, in BipartiteLong.clean(self, params)
126 params['connectedness'] = None
128 ## Initial cleaning ##
--> 129 frame = super().clean(params)
131 if collapse:
132 ## Collapse then compute largest connected set ##
133 # Update parameters
134 level_dict = {
135 'leave_out_spell': 'spell',
136 'leave_out_match': 'match'
137 }
File ~/opt/anaconda3/envs/stata-env/lib/python3.9/site-packages/bipartitepandas/bipartitelongbase.py:104, in BipartiteLongBase.clean(self, params)
102 if verbose:
103 tqdm.write('checking required columns and datatypes')
--> 104 frame._check_cols()
106 # Next, sort rows
107 self.log('sorting rows', level='info')
File ~/opt/anaconda3/envs/stata-env/lib/python3.9/site-packages/bipartitepandas/bipartitebase.py:1196, in BipartiteBase._check_cols(self)
1194 error_msg = f"{col} is included in the dataframe but is not saved in .col_reference_dict. Please initialize your BipartiteBase object to include this column by setting 'col_reference_dict=your_col_reference_dict'."
1195 self.log(error_msg, level='info')
-> 1196 raise ValueError(error_msg)
ValueError: alpha is included in the dataframe but is not saved in .col_reference_dict. Please initialize your BipartiteBase object to include this column by setting 'col_reference_dict=your_col_reference_dict'.