Connected sets

Import the BipartitePandas package

Make sure to install it using pip install bipartitepandas.

[1]:

import bipartitepandas as bpd

Get your data ready

For this notebook, we simulate data (we set parameters to make the connected sets interesting).

[2]:

df = bpd.SimBipartite(
    bpd.sim_params(
        {
            'firm_size': 10,
            'p_move': 0.05
        }
    )
).simulate()
bdf = bpd.BipartiteDataFrame(
    i=df['i'], j=df['j'], y=df['y'], t=df['t']
)
display(bdf)

	i	j	y	t
0	0	164	-1.255266	0
1	0	164	0.395999	1
2	0	164	-1.273443	2
3	0	164	-1.404883	3
4	0	164	-0.783187	4
...	...	...	...	...
49995	9999	786	-0.510697	0
49996	9999	786	1.994076	1
49997	9999	786	2.570977	2
49998	9999	786	0.535605	3
49999	9999	786	-0.202017	4

50000 rows × 4 columns

Computing connected sets

There are eight connectedness options:

None
Connected
Strongly connected
Leave-out-observation
Leave-out-spell
Leave-out-match
Leave-out-worker
Leave-out-firm

These are specified in the cleaning parameters dictionary under the key 'connectedness'. We will demonstrate 'connectedness' = None and 'connectedness' = 'leave_out_observation'.

Note

Leave-out-spell and leave-out-match are distinguished by workers who leave a firm then return to it.

Note

Stayers who have only a single observation after computing the largest connected set can be dropped by specifying 'drop_single_stayers' = True in your cleaning parameters dictionary.

Warning

Connectedness is not necessarily maintained between non-collapsed and collapsed formats. Therefore, if you plan to use connected, collapsed data, it is recommended to set the connectedness level at the level at which you would to collapse your data, and to set 'collapse_at_connectedness_measure' = True in your cleaning parameters dictionary. An example is given below.

‘connectedness’ = None

[3]:

conn_none = bdf.clean(
    bpd.clean_params(
        {
            'connectedness': None,
            'verbose': False
        }
    )
)
display(conn_none)

	i	j	y	t	m
0	0	164	-1.255266	0	0
1	0	164	0.395999	1	0
2	0	164	-1.273443	2	0
3	0	164	-1.404883	3	0
4	0	164	-0.783187	4	0
...	...	...	...	...	...
49995	9999	786	-0.510697	0	0
49996	9999	786	1.994076	1	0
49997	9999	786	2.570977	2	0
49998	9999	786	0.535605	3	0
49999	9999	786	-0.202017	4	0

50000 rows × 5 columns

‘connectedness’ = ‘leave_out_observation’

[4]:

conn_loo = bdf.clean(
    bpd.clean_params(
        {
            'connectedness': 'leave_out_observation',
            'verbose': False
        }
    )
)
display(conn_loo)

	i	j	y	t	m
0	0	0	-1.255266	0	0
1	0	0	0.395999	1	0
2	0	0	-1.273443	2	0
3	0	0	-1.404883	3	0
4	0	0	-0.783187	4	0
...	...	...	...	...	...
44856	9014	876	-0.510697	0	0
44857	9014	876	1.994076	1	0
44858	9014	876	2.570977	2	0
44859	9014	876	0.535605	3	0
44860	9014	876	-0.202017	4	0

44861 rows × 5 columns

Connected sets for collapsed data

As mentioned above, connectedness is not necessarily maintained between non-collapsed and collapsed formats.

Here we show an example that demonstrates this, then show how setting 'collapse_at_connectedness_measure' = True in your cleaning parameters dictionary will give the correct results, all in one line.

[5]:

coll_conn_loo_wrong = conn_loo.collapse(level='spell')
display(coll_conn_loo_wrong)

	i	j	y	t1	t2	w	m
0	0	0	-0.864156	0	4	5	0
1	1	1	-2.076316	0	4	5	0
2	2	2	-0.360893	0	4	5	0
3	3	3	-1.256533	0	4	5	0
4	4	4	1.233623	1	4	4	0
...	...	...	...	...	...	...	...
10891	9011	697	0.423780	0	4	5	0
10892	9012	476	0.385739	0	4	5	0
10893	9013	796	1.546216	0	0	1	1
10894	9013	381	-0.333299	1	4	4	1
10895	9014	876	0.877589	0	4	5	0

10896 rows × 7 columns

[6]:

coll_conn_loo_right_1 = bdf.clean(
    bpd.clean_params(
        {
            'connectedness': None,
            'verbose': False
        }
    )
).collapse(level='spell').clean(
    bpd.clean_params(
        {
            'connectedness': 'leave_out_observation',
            'verbose': False
        }
    )
)
display(coll_conn_loo_right_1)

	i	j	y	t1	t2	w	m
0	0	0	-0.864156	0	4	5.0	0
1	1	1	-2.076316	0	4	5.0	0
2	2	2	-0.360893	0	4	5.0	0
3	3	3	-1.256533	0	4	5.0	0
4	4	4	1.233623	1	4	4.0	0
...	...	...	...	...	...	...	...
10878	8999	696	0.423780	0	4	5.0	0
10879	9000	475	0.385739	0	4	5.0	0
10880	9001	795	1.546216	0	0	1.0	1
10881	9001	380	-0.333299	1	4	4.0	1
10882	9002	875	0.877589	0	4	5.0	0

10883 rows × 7 columns

Simpler code

Instead of cleaning, collapsing, then cleaning again, we can do it all at once by specifying 'connectedness' = 'leave_out_spell' (or 'leave_out_match') and 'collapse_at_connectedness_measure' = True.

[7]:

coll_conn_loo_right_2 = bdf.clean(
    bpd.clean_params(
        {
            'connectedness': 'leave_out_spell',
            'collapse_at_connectedness_measure': True,
            'verbose': False
        }
    )
)
display(coll_conn_loo_right_2)

	i	j	y	t1	t2	w	m
0	0	0	-0.864156	0	4	5	0
1	1	1	-2.076316	0	4	5	0
2	2	2	-0.360893	0	4	5	0
3	3	3	-1.256533	0	4	5	0
4	4	4	1.233623	1	4	4	0
...	...	...	...	...	...	...	...
10878	8999	696	0.423780	0	4	5	0
10879	9000	475	0.385739	0	4	5	0
10880	9001	795	1.546216	0	0	1	1
10881	9001	380	-0.333299	1	4	4	1
10882	9002	875	0.877589	0	4	5	0

10883 rows × 7 columns