Simple example
Import the BipartitePandas package
Make sure to install it using pip install bipartitepandas
.
[1]:
import bipartitepandas as bpd
Get your data ready
For this notebook, we simulate data (we set parameters to make data cleaning interesting).
[2]:
df = bpd.SimBipartite(
bpd.sim_params(
{
'firm_size': 10,
'p_move': 0.05
}
)
).simulate()
display(df)
i | j | y | t | l | k | alpha | psi | |
---|---|---|---|---|---|---|---|---|
0 | 0 | 416 | -0.935824 | 0 | 2 | 4 | 0.000000 | -0.114185 |
1 | 0 | 416 | 0.903535 | 1 | 2 | 4 | 0.000000 | -0.114185 |
2 | 0 | 416 | -0.466674 | 2 | 2 | 4 | 0.000000 | -0.114185 |
3 | 0 | 416 | 0.163563 | 3 | 2 | 4 | 0.000000 | -0.114185 |
4 | 0 | 416 | 0.602699 | 4 | 2 | 4 | 0.000000 | -0.114185 |
... | ... | ... | ... | ... | ... | ... | ... | ... |
49995 | 9999 | 344 | -0.781453 | 0 | 1 | 3 | -0.430727 | -0.348756 |
49996 | 9999 | 344 | 0.008461 | 1 | 1 | 3 | -0.430727 | -0.348756 |
49997 | 9999 | 344 | -0.959677 | 2 | 1 | 3 | -0.430727 | -0.348756 |
49998 | 9999 | 344 | 0.068173 | 3 | 1 | 3 | -0.430727 | -0.348756 |
49999 | 9999 | 344 | -0.733225 | 4 | 1 | 3 | -0.430727 | -0.348756 |
50000 rows × 8 columns
Columns
BipartitePandas includes seven pre-defined general columns:
Required
i
: worker id (any type)j
: firm id (any type)y
: income (float or int)
Optional
t
: time (int)g
: firm type (any type)w
: weight (float or int)m
: move indicator (int)
Constructing DataFrames
How do we construct a dataframe? Just use the required columns (plus any optional columns you want to include)!
[3]:
bdf = bpd.BipartiteDataFrame(
i=df['i'], j=df['j'], y=df['y'], t=df['t']
)
display(bdf)
i | j | y | t | |
---|---|---|---|---|
0 | 0 | 416 | -0.935824 | 0 |
1 | 0 | 416 | 0.903535 | 1 |
2 | 0 | 416 | -0.466674 | 2 |
3 | 0 | 416 | 0.163563 | 3 |
4 | 0 | 416 | 0.602699 | 4 |
... | ... | ... | ... | ... |
49995 | 9999 | 344 | -0.781453 | 0 |
49996 | 9999 | 344 | 0.008461 | 1 |
49997 | 9999 | 344 | -0.959677 | 2 |
49998 | 9999 | 344 | 0.068173 | 3 |
49999 | 9999 | 344 | -0.733225 | 4 |
50000 rows × 4 columns
Now that we have our dataframe, let’s check out some summary statistics
[4]:
bdf.summary()
format: 'BipartiteLong'
number of workers: 10000
number of firms: 999
number of observations: 50000
mean wage: 0.008291372114450858
median wage: 0.010194526935672465
min wage: -5.6066321450713374
max wage: 6.086789605058163
var(wage): 2.64936416420792
no NaN values: False
no duplicates: False
i-t (worker-year) observations unique (None if t column(s) not included): False
no returns (None if not yet computed): None
contiguous 'i' ids (None if not included): False
contiguous 'j' ids (None if not included): False
contiguous 'g' ids (None if not included): None
connectedness (None if ignoring connectedness): None
Let’s clean our data - and make sure the result is leave-one-observation-out connected
Hint
Want details on all cleaning parameters? Run bpd.clean_params().describe_all()
, or search through bpd.clean_params().keys()
for a particular key, and then run bpd.clean_params().describe(key)
.
[5]:
bdf = bdf.clean(
bpd.clean_params(
{
'connectedness': 'leave_out_observation'
}
)
)
display(bdf)
checking required columns and datatypes
sorting rows
dropping NaN observations
generating 'm' column
keeping highest paying job for i-t (worker-year) duplicates (how='max')
dropping workers who leave a firm then return to it (how=False)
making 'i' ids contiguous
making 'j' ids contiguous
computing largest connected set (how='leave_out_observation')
making 'i' ids contiguous
making 'j' ids contiguous
sorting columns
resetting index
i | j | y | t | m | |
---|---|---|---|---|---|
0 | 0 | 0 | 1.306939 | 0 | 0 |
1 | 0 | 0 | -0.005591 | 1 | 0 |
2 | 0 | 0 | -0.192813 | 2 | 0 |
3 | 0 | 0 | 2.537212 | 3 | 0 |
4 | 0 | 0 | 1.756664 | 4 | 0 |
... | ... | ... | ... | ... | ... |
44576 | 8962 | 797 | -0.781453 | 0 | 0 |
44577 | 8962 | 797 | 0.008461 | 1 | 0 |
44578 | 8962 | 797 | -0.959677 | 2 | 0 |
44579 | 8962 | 797 | 0.068173 | 3 | 0 |
44580 | 8962 | 797 | -0.733225 | 4 | 0 |
44581 rows × 5 columns
We can check how the summary statistics changed:
[6]:
bdf.summary()
format: 'BipartiteLong'
number of workers: 8963
number of firms: 874
number of observations: 44581
mean wage: 0.0043933052888950955
median wage: 0.005410443676386101
min wage: -5.6066321450713374
max wage: 6.086789605058163
var(wage): 2.658089932617942
no NaN values: True
no duplicates: True
i-t (worker-year) observations unique (None if t column(s) not included): True
no returns (None if not yet computed): True
contiguous 'i' ids (None if not included): True
contiguous 'j' ids (None if not included): True
contiguous 'g' ids (None if not included): None
connectedness (None if ignoring connectedness): 'leave_out_observation'