Simple example

Import the BipartitePandas package

Make sure to install it using pip install bipartitepandas.

[1]:
import bipartitepandas as bpd

Get your data ready

For this notebook, we simulate data (we set parameters to make data cleaning interesting).

[2]:
df = bpd.SimBipartite(
    bpd.sim_params(
        {
            'firm_size': 10,
            'p_move': 0.05
        }
    )
).simulate()
display(df)
i j y t l k alpha psi
0 0 416 -0.935824 0 2 4 0.000000 -0.114185
1 0 416 0.903535 1 2 4 0.000000 -0.114185
2 0 416 -0.466674 2 2 4 0.000000 -0.114185
3 0 416 0.163563 3 2 4 0.000000 -0.114185
4 0 416 0.602699 4 2 4 0.000000 -0.114185
... ... ... ... ... ... ... ... ...
49995 9999 344 -0.781453 0 1 3 -0.430727 -0.348756
49996 9999 344 0.008461 1 1 3 -0.430727 -0.348756
49997 9999 344 -0.959677 2 1 3 -0.430727 -0.348756
49998 9999 344 0.068173 3 1 3 -0.430727 -0.348756
49999 9999 344 -0.733225 4 1 3 -0.430727 -0.348756

50000 rows × 8 columns

Columns

BipartitePandas includes seven pre-defined general columns:

Required

  • i: worker id (any type)

  • j: firm id (any type)

  • y: income (float or int)

Optional

  • t: time (int)

  • g: firm type (any type)

  • w: weight (float or int)

  • m: move indicator (int)

Constructing DataFrames

How do we construct a dataframe? Just use the required columns (plus any optional columns you want to include)!

[3]:
bdf = bpd.BipartiteDataFrame(
    i=df['i'], j=df['j'], y=df['y'], t=df['t']
)
display(bdf)
i j y t
0 0 416 -0.935824 0
1 0 416 0.903535 1
2 0 416 -0.466674 2
3 0 416 0.163563 3
4 0 416 0.602699 4
... ... ... ... ...
49995 9999 344 -0.781453 0
49996 9999 344 0.008461 1
49997 9999 344 -0.959677 2
49998 9999 344 0.068173 3
49999 9999 344 -0.733225 4

50000 rows × 4 columns

Now that we have our dataframe, let’s check out some summary statistics

[4]:
bdf.summary()
format: 'BipartiteLong'
number of workers: 10000
number of firms: 999
number of observations: 50000
mean wage: 0.008291372114450858
median wage: 0.010194526935672465
min wage: -5.6066321450713374
max wage: 6.086789605058163
var(wage): 2.64936416420792
no NaN values: False
no duplicates: False
i-t (worker-year) observations unique (None if t column(s) not included): False
no returns (None if not yet computed): None
contiguous 'i' ids (None if not included): False
contiguous 'j' ids (None if not included): False
contiguous 'g' ids (None if not included): None
connectedness (None if ignoring connectedness): None

Let’s clean our data - and make sure the result is leave-one-observation-out connected

Hint

Want details on all cleaning parameters? Run bpd.clean_params().describe_all(), or search through bpd.clean_params().keys() for a particular key, and then run bpd.clean_params().describe(key).

[5]:
bdf = bdf.clean(
    bpd.clean_params(
        {
            'connectedness': 'leave_out_observation'
        }
    )
)
display(bdf)
checking required columns and datatypes
sorting rows
dropping NaN observations
generating 'm' column
keeping highest paying job for i-t (worker-year) duplicates (how='max')
dropping workers who leave a firm then return to it (how=False)
making 'i' ids contiguous
making 'j' ids contiguous
computing largest connected set (how='leave_out_observation')
making 'i' ids contiguous
making 'j' ids contiguous
sorting columns
resetting index
i j y t m
0 0 0 1.306939 0 0
1 0 0 -0.005591 1 0
2 0 0 -0.192813 2 0
3 0 0 2.537212 3 0
4 0 0 1.756664 4 0
... ... ... ... ... ...
44576 8962 797 -0.781453 0 0
44577 8962 797 0.008461 1 0
44578 8962 797 -0.959677 2 0
44579 8962 797 0.068173 3 0
44580 8962 797 -0.733225 4 0

44581 rows × 5 columns

We can check how the summary statistics changed:

[6]:
bdf.summary()
format: 'BipartiteLong'
number of workers: 8963
number of firms: 874
number of observations: 44581
mean wage: 0.0043933052888950955
median wage: 0.005410443676386101
min wage: -5.6066321450713374
max wage: 6.086789605058163
var(wage): 2.658089932617942
no NaN values: True
no duplicates: True
i-t (worker-year) observations unique (None if t column(s) not included): True
no returns (None if not yet computed): True
contiguous 'i' ids (None if not included): True
contiguous 'j' ids (None if not included): True
contiguous 'g' ids (None if not included): None
connectedness (None if ignoring connectedness): 'leave_out_observation'