Custom columns

Why do we care?

Specifying custom columns correctly in BipartitePandas is very important. This ensures custom columns interact with classes and methods properly - otherwise, conversions between types are likely to drop these columns, and method calls may not apply to these columns, they may apply incorrectly, or they may raise errors.

Import the BipartitePandas package

Make sure to install it using pip install bipartitepandas.

[1]:
import bipartitepandas as bpd

Get your data ready

For this notebook, we simulate data.

[2]:
df = bpd.SimBipartite().simulate()
display(df)
i j y t l k alpha psi
0 0 36 -1.569165 0 2 1 0.000000 -0.908458
1 0 17 2.442324 1 2 0 0.000000 -1.335178
2 0 17 -1.307551 2 2 0 0.000000 -1.335178
3 0 17 -1.551354 3 2 0 0.000000 -1.335178
4 0 13 -0.789661 4 2 0 0.000000 -1.335178
... ... ... ... ... ... ... ... ...
49995 9999 102 -1.493225 0 1 5 -0.430727 0.114185
49996 9999 116 2.368321 1 1 5 -0.430727 0.114185
49997 9999 76 -2.070787 2 1 3 -0.430727 -0.348756
49998 9999 23 -1.203733 3 1 1 -0.430727 -0.908458
49999 9999 23 0.132797 4 1 1 -0.430727 -0.908458

50000 rows × 8 columns

Columns

BipartitePandas includes seven pre-defined general columns:

Required

  • i: worker id (any type)

  • j: firm id (any type)

  • y: income (float or int)

Optional

  • t: time (int)

  • g: firm type (any type)

  • w: weight (float or int)

  • m: move indicator (int)

Formats

BipartitePandas includes four formats:

  • Long - each row gives a single observation

  • Collapsed Long - like Long, but employment spells at the same firm are collapsed into a single observation

  • Event Study - each row gives two consecutive observations

  • Collapsed Event Study - like Event Study, but employment spells at the same firm are collapsed into a single observation

These formats divide general columns differently:

  • Long - i, j, y, t, g, w, m

  • Collapsed Long - i, j, y, t1, t2, g, w, m

  • Event Study - i, j1, j2, y1, y2, t1, t2, g1, g2, w1, w2, m

  • Collapsed Event Study - i, j1, j2, y1, y2, t11, t12, t21, t22, g1, g2, w1, w2, m

Constructing DataFrames

Our simulated data is in Long format, but includes columns that aren’t pre-defined. How do we construct a Long dataframe that includes these columns?

[3]:
bdf_long = bpd.BipartiteDataFrame(
    i=df['i'], j=df['j'], y=df['y'], t=df['t'],
    l=df['l'], k=df['k'], alpha=df['alpha'], psi=df['psi']
)
display(bdf_long)
i j y t alpha k l psi
0 0 36 -1.569165 0 0.000000 1 2 -0.908458
1 0 17 2.442324 1 0.000000 0 2 -1.335178
2 0 17 -1.307551 2 0.000000 0 2 -1.335178
3 0 17 -1.551354 3 0.000000 0 2 -1.335178
4 0 13 -0.789661 4 0.000000 0 2 -1.335178
... ... ... ... ... ... ... ... ...
49995 9999 102 -1.493225 0 -0.430727 5 1 0.114185
49996 9999 116 2.368321 1 -0.430727 5 1 0.114185
49997 9999 76 -2.070787 2 -0.430727 3 1 -0.348756
49998 9999 23 -1.203733 3 -0.430727 1 1 -0.908458
49999 9999 23 0.132797 4 -0.430727 1 1 -0.908458

50000 rows × 8 columns

Are we sure this is long? Let’s check the datatype:

[4]:
type(bdf_long)
[4]:
bipartitepandas.bipartitelong.BipartiteLong

Categorical columns

What if we want to specify a column should be categorical? Then we should specify custom_categorical_dict!

Note

alpha is float, and BipartiteDataFrame automatically sets floats to collapse by mean. Categorical columns cannot be collapsed by mean, so if we mark alpha as categorical, we must also specify that it should collapse by first (last or None also work). In addition, categorical columns must use the datatype 'categorical'.

[5]:
bdf_long = bpd.BipartiteDataFrame(
    i=df['i'], j=df['j'], y=df['y'], t=df['t'],
    l=df['l'], k=df['k'], alpha=df['alpha'], psi=df['psi'],
    custom_categorical_dict={'alpha': True},
    custom_dtype_dict={'alpha': 'categorical'},
    custom_how_collapse_dict={'alpha': 'first'}
).clean()
display(bdf_long)
checking required columns and datatypes
sorting rows
dropping NaN observations
generating 'm' column
keeping highest paying job for i-t (worker-year) duplicates (how='max')
dropping workers who leave a firm then return to it (how=False)
making 'i' ids contiguous
making 'j' ids contiguous
making 'alpha' ids contiguous
computing largest connected set (how=None)
sorting columns
resetting index
i j y t m alpha k l psi
0 0 36 -1.569165 0 1 0 1 2 -0.908458
1 0 17 2.442324 1 1 0 0 2 -1.335178
2 0 17 -1.307551 2 0 0 0 2 -1.335178
3 0 17 -1.551354 3 1 0 0 2 -1.335178
4 0 13 -0.789661 4 1 0 0 2 -1.335178
... ... ... ... ... ... ... ... ... ...
49995 9999 102 -1.493225 0 1 1 5 1 0.114185
49996 9999 116 2.368321 1 2 1 5 1 0.114185
49997 9999 76 -2.070787 2 2 1 3 1 -0.348756
49998 9999 23 -1.203733 3 1 1 1 1 -0.908458
49999 9999 23 0.132797 4 0 1 1 1 -0.908458

50000 rows × 9 columns

Collapsing data

What if instead of collapsing by the mean, we want a column to collapse by first, or even to drop when we collapse? Then we should specify custom_how_collapse_dict!

[6]:
bdf_long = bpd.BipartiteDataFrame(
    i=df['i'], j=df['j'], y=df['y'], t=df['t'],
    l=df['l'], k=df['k'], alpha=df['alpha'], psi=df['psi'],
    custom_how_collapse_dict={'alpha': None, 'psi': 'first'}
).clean().collapse()
display(bdf_long)
checking required columns and datatypes
sorting rows
dropping NaN observations
generating 'm' column
keeping highest paying job for i-t (worker-year) duplicates (how='max')
dropping workers who leave a firm then return to it (how=False)
making 'i' ids contiguous
making 'j' ids contiguous
computing largest connected set (how=None)
sorting columns
resetting index
i j y t1 t2 w m k l psi
0 0 36 -1.569165 0 0 1 1 1.0 2.0 -0.908458
1 0 17 -0.138861 1 3 3 2 0.0 2.0 -1.335178
2 0 13 -0.789661 4 4 1 1 0.0 2.0 -1.335178
3 1 52 -0.653218 0 0 1 1 2.0 1.0 -0.604585
4 1 49 0.676861 1 2 2 2 2.0 1.0 -0.604585
... ... ... ... ... ... ... ... ... ... ...
29820 9998 52 -2.618451 4 4 1 1 2.0 0.0 -0.604585
29821 9999 102 -1.493225 0 0 1 1 5.0 1.0 0.114185
29822 9999 116 2.368321 1 1 1 2 5.0 1.0 0.114185
29823 9999 76 -2.070787 2 2 1 2 3.0 1.0 -0.348756
29824 9999 23 -0.535468 3 4 2 1 1.0 1.0 -0.908458

29825 rows × 10 columns

Warning

Collapsing by first, last, mean, and sum will uncollapse correctly (although information may be lost); any other option (e.g. var or std) is not guaranteed to uncollapse correctly.

Converting between (collapsed) long and (collapsed) event study formats

What if we don’t want a column to split when converting to event study, or if we want it to drop during the conversion? Then we should specify custom_long_es_split_dict!

[7]:
bdf_long = bpd.BipartiteDataFrame(
    i=df['i'], j=df['j'], y=df['y'], t=df['t'],
    l=df['l'], k=df['k'], alpha=df['alpha'], psi=df['psi'],
    custom_long_es_split_dict={'alpha': False, 'psi': None}
).clean().to_eventstudy()
display(bdf_long)
checking required columns and datatypes
sorting rows
dropping NaN observations
generating 'm' column
keeping highest paying job for i-t (worker-year) duplicates (how='max')
dropping workers who leave a firm then return to it (how=False)
making 'i' ids contiguous
making 'j' ids contiguous
computing largest connected set (how=None)
sorting columns
resetting index
i j1 j2 y1 y2 t1 t2 m alpha k1 k2 l1 l2
0 0 36 17 -1.569165 2.442324 0 1 1 0.000000 1 0 2 2
1 0 17 17 2.442324 -1.307551 1 2 0 0.000000 0 0 2 2
2 0 17 17 -1.307551 -1.551354 2 3 0 0.000000 0 0 2 2
3 0 17 13 -1.551354 -0.789661 3 4 1 0.000000 0 0 2 2
4 1 52 49 -0.653218 1.597527 0 1 1 -0.430727 2 2 1 1
... ... ... ... ... ... ... ... ... ... ... ... ... ...
40657 9998 2 52 -2.069815 -2.618451 3 4 1 -0.967422 0 2 0 0
40658 9999 102 116 -1.493225 2.368321 0 1 1 -0.430727 5 5 1 1
40659 9999 116 76 2.368321 -2.070787 1 2 1 -0.430727 5 3 1 1
40660 9999 76 23 -2.070787 -1.203733 2 3 1 -0.430727 3 1 1 1
40661 9999 23 23 -1.203733 0.132797 3 4 0 -0.430727 1 1 1 1

40662 rows × 13 columns

Adding custom columns to an instantiated DataFrame

Use the method .add_column() to add custom columns to a DataFrame that has already been instantiated.

[8]:
bdf_long = bpd.BipartiteDataFrame(
    i=df['i'], j=df['j'], y=df['y'], t=df['t'],
    l=df['l'], k=df['k']
)
bdf_long = bdf_long.add_column('alpha', df['alpha'], is_categorical=True, dtype='categorical')
bdf_long = bdf_long.add_column('psi', df['psi'], is_categorical=True, dtype='categorical')
bdf_long = bdf_long.clean()
display(bdf_long)
checking required columns and datatypes
sorting rows
dropping NaN observations
generating 'm' column
keeping highest paying job for i-t (worker-year) duplicates (how='max')
dropping workers who leave a firm then return to it (how=False)
making 'i' ids contiguous
making 'j' ids contiguous
making 'alpha' ids contiguous
making 'psi' ids contiguous
computing largest connected set (how=None)
sorting columns
resetting index
i j y t m alpha k l psi
0 0 36 -1.569165 0 1 0 1 2 0
1 0 17 2.442324 1 1 0 0 2 1
2 0 17 -1.307551 2 0 0 0 2 1
3 0 17 -1.551354 3 1 0 0 2 1
4 0 13 -0.789661 4 1 0 0 2 1
... ... ... ... ... ... ... ... ... ...
49995 9999 102 -1.493225 0 1 1 5 1 4
49996 9999 116 2.368321 1 2 1 5 1 4
49997 9999 76 -2.070787 2 2 1 3 1 6
49998 9999 23 -1.203733 3 1 1 1 1 0
49999 9999 23 0.132797 4 0 1 1 1 0

50000 rows × 9 columns

Here, we see what happens if we add custom columns incorrectly. In this example, we see that cleaning will raise an error. If instead, we try to bypass data cleaning and immediately convert between data formats, the custom columns will be dropped during the conversion.

[9]:
bdf_long = bpd.BipartiteDataFrame(
    i=df['i'], j=df['j'], y=df['y'], t=df['t'],
    l=df['l'], k=df['k']
)
bdf_long['alpha'] = df['alpha']
bdf_long['psi'] = df['psi']
bdf_long = bdf_long.clean()
checking required columns and datatypes
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
Input In [9], in <cell line: 7>()
      5 bdf_long['alpha'] = df['alpha']
      6 bdf_long['psi'] = df['psi']
----> 7 bdf_long = bdf_long.clean()

File ~/opt/anaconda3/envs/stata-env/lib/python3.9/site-packages/bipartitepandas/bipartitelong.py:129, in BipartiteLong.clean(self, params)
    126     params['connectedness'] = None
    128 ## Initial cleaning ##
--> 129 frame = super().clean(params)
    131 if collapse:
    132     ## Collapse then compute largest connected set ##
    133     # Update parameters
    134     level_dict = {
    135         'leave_out_spell': 'spell',
    136         'leave_out_match': 'match'
    137     }

File ~/opt/anaconda3/envs/stata-env/lib/python3.9/site-packages/bipartitepandas/bipartitelongbase.py:104, in BipartiteLongBase.clean(self, params)
    102 if verbose:
    103     tqdm.write('checking required columns and datatypes')
--> 104 frame._check_cols()
    106 # Next, sort rows
    107 self.log('sorting rows', level='info')

File ~/opt/anaconda3/envs/stata-env/lib/python3.9/site-packages/bipartitepandas/bipartitebase.py:1196, in BipartiteBase._check_cols(self)
   1194 error_msg = f"{col} is included in the dataframe but is not saved in .col_reference_dict. Please initialize your BipartiteBase object to include this column by setting 'col_reference_dict=your_col_reference_dict'."
   1195 self.log(error_msg, level='info')
-> 1196 raise ValueError(error_msg)

ValueError: alpha is included in the dataframe but is not saved in .col_reference_dict. Please initialize your BipartiteBase object to include this column by setting 'col_reference_dict=your_col_reference_dict'.