Connected sets

Import the BipartitePandas package

Make sure to install it using pip install bipartitepandas.

[1]:
import bipartitepandas as bpd

Get your data ready

For this notebook, we simulate data (we set parameters to make the connected sets interesting).

[2]:
df = bpd.SimBipartite(
    bpd.sim_params(
        {
            'firm_size': 10,
            'p_move': 0.05
        }
    )
).simulate()
bdf = bpd.BipartiteDataFrame(
    i=df['i'], j=df['j'], y=df['y'], t=df['t']
)
display(bdf)
i j y t
0 0 164 -1.255266 0
1 0 164 0.395999 1
2 0 164 -1.273443 2
3 0 164 -1.404883 3
4 0 164 -0.783187 4
... ... ... ... ...
49995 9999 786 -0.510697 0
49996 9999 786 1.994076 1
49997 9999 786 2.570977 2
49998 9999 786 0.535605 3
49999 9999 786 -0.202017 4

50000 rows × 4 columns

Computing connected sets

There are eight connectedness options:

  • None

  • Connected

  • Strongly connected

  • Leave-out-observation

  • Leave-out-spell

  • Leave-out-match

  • Leave-out-worker

  • Leave-out-firm

These are specified in the cleaning parameters dictionary under the key 'connectedness'. We will demonstrate 'connectedness' = None and 'connectedness' = 'leave_out_observation'.

Note

Leave-out-spell and leave-out-match are distinguished by workers who leave a firm then return to it.

Note

Stayers who have only a single observation after computing the largest connected set can be dropped by specifying 'drop_single_stayers' = True in your cleaning parameters dictionary.

Warning

Connectedness is not necessarily maintained between non-collapsed and collapsed formats. Therefore, if you plan to use connected, collapsed data, it is recommended to set the connectedness level at the level at which you would to collapse your data, and to set 'collapse_at_connectedness_measure' = True in your cleaning parameters dictionary. An example is given below.

‘connectedness’ = None

[3]:
conn_none = bdf.clean(
    bpd.clean_params(
        {
            'connectedness': None,
            'verbose': False
        }
    )
)
display(conn_none)
i j y t m
0 0 164 -1.255266 0 0
1 0 164 0.395999 1 0
2 0 164 -1.273443 2 0
3 0 164 -1.404883 3 0
4 0 164 -0.783187 4 0
... ... ... ... ... ...
49995 9999 786 -0.510697 0 0
49996 9999 786 1.994076 1 0
49997 9999 786 2.570977 2 0
49998 9999 786 0.535605 3 0
49999 9999 786 -0.202017 4 0

50000 rows × 5 columns

‘connectedness’ = ‘leave_out_observation’

[4]:
conn_loo = bdf.clean(
    bpd.clean_params(
        {
            'connectedness': 'leave_out_observation',
            'verbose': False
        }
    )
)
display(conn_loo)
i j y t m
0 0 0 -1.255266 0 0
1 0 0 0.395999 1 0
2 0 0 -1.273443 2 0
3 0 0 -1.404883 3 0
4 0 0 -0.783187 4 0
... ... ... ... ... ...
44856 9014 876 -0.510697 0 0
44857 9014 876 1.994076 1 0
44858 9014 876 2.570977 2 0
44859 9014 876 0.535605 3 0
44860 9014 876 -0.202017 4 0

44861 rows × 5 columns

Connected sets for collapsed data

As mentioned above, connectedness is not necessarily maintained between non-collapsed and collapsed formats.

Here we show an example that demonstrates this, then show how setting 'collapse_at_connectedness_measure' = True in your cleaning parameters dictionary will give the correct results, all in one line.

[5]:
coll_conn_loo_wrong = conn_loo.collapse(level='spell')
display(coll_conn_loo_wrong)
i j y t1 t2 w m
0 0 0 -0.864156 0 4 5 0
1 1 1 -2.076316 0 4 5 0
2 2 2 -0.360893 0 4 5 0
3 3 3 -1.256533 0 4 5 0
4 4 4 1.233623 1 4 4 0
... ... ... ... ... ... ... ...
10891 9011 697 0.423780 0 4 5 0
10892 9012 476 0.385739 0 4 5 0
10893 9013 796 1.546216 0 0 1 1
10894 9013 381 -0.333299 1 4 4 1
10895 9014 876 0.877589 0 4 5 0

10896 rows × 7 columns

[6]:
coll_conn_loo_right_1 = bdf.clean(
    bpd.clean_params(
        {
            'connectedness': None,
            'verbose': False
        }
    )
).collapse(level='spell').clean(
    bpd.clean_params(
        {
            'connectedness': 'leave_out_observation',
            'verbose': False
        }
    )
)
display(coll_conn_loo_right_1)
i j y t1 t2 w m
0 0 0 -0.864156 0 4 5.0 0
1 1 1 -2.076316 0 4 5.0 0
2 2 2 -0.360893 0 4 5.0 0
3 3 3 -1.256533 0 4 5.0 0
4 4 4 1.233623 1 4 4.0 0
... ... ... ... ... ... ... ...
10878 8999 696 0.423780 0 4 5.0 0
10879 9000 475 0.385739 0 4 5.0 0
10880 9001 795 1.546216 0 0 1.0 1
10881 9001 380 -0.333299 1 4 4.0 1
10882 9002 875 0.877589 0 4 5.0 0

10883 rows × 7 columns

Simpler code

Instead of cleaning, collapsing, then cleaning again, we can do it all at once by specifying 'connectedness' = 'leave_out_spell' (or 'leave_out_match') and 'collapse_at_connectedness_measure' = True.

[7]:
coll_conn_loo_right_2 = bdf.clean(
    bpd.clean_params(
        {
            'connectedness': 'leave_out_spell',
            'collapse_at_connectedness_measure': True,
            'verbose': False
        }
    )
)
display(coll_conn_loo_right_2)
i j y t1 t2 w m
0 0 0 -0.864156 0 4 5 0
1 1 1 -2.076316 0 4 5 0
2 2 2 -0.360893 0 4 5 0
3 3 3 -1.256533 0 4 5 0
4 4 4 1.233623 1 4 4 0
... ... ... ... ... ... ... ...
10878 8999 696 0.423780 0 4 5 0
10879 9000 475 0.385739 0 4 5 0
10880 9001 795 1.546216 0 0 1 1
10881 9001 380 -0.333299 1 4 4 1
10882 9002 875 0.877589 0 4 5 0

10883 rows × 7 columns