Connected sets
Import the BipartitePandas package
Make sure to install it using pip install bipartitepandas
.
Get your data ready
For this notebook, we simulate data (we set parameters to make the connected sets interesting).
|
i |
j |
y |
t |
0 |
0 |
164 |
-1.255266 |
0 |
1 |
0 |
164 |
0.395999 |
1 |
2 |
0 |
164 |
-1.273443 |
2 |
3 |
0 |
164 |
-1.404883 |
3 |
4 |
0 |
164 |
-0.783187 |
4 |
... |
... |
... |
... |
... |
49995 |
9999 |
786 |
-0.510697 |
0 |
49996 |
9999 |
786 |
1.994076 |
1 |
49997 |
9999 |
786 |
2.570977 |
2 |
49998 |
9999 |
786 |
0.535605 |
3 |
49999 |
9999 |
786 |
-0.202017 |
4 |
50000 rows × 4 columns
Computing connected sets
There are eight connectedness options:
None
Connected
Strongly connected
Leave-out-observation
Leave-out-spell
Leave-out-match
Leave-out-worker
Leave-out-firm
These are specified in the cleaning parameters dictionary under the key 'connectedness'
. We will demonstrate 'connectedness' = None
and 'connectedness' = 'leave_out_observation'
.
Note
Leave-out-spell and leave-out-match are distinguished by workers who leave a firm then return to it.
Note
Stayers who have only a single observation after computing the largest connected set can be dropped by specifying 'drop_single_stayers' = True
in your cleaning parameters dictionary.
Warning
Connectedness is not necessarily maintained between non-collapsed and collapsed formats. Therefore, if you plan to use connected, collapsed data, it is recommended to set the connectedness level at the level at which you would to collapse your data, and to set 'collapse_at_connectedness_measure' = True
in your cleaning parameters dictionary. An example is given below.
‘connectedness’ = None
|
i |
j |
y |
t |
m |
0 |
0 |
164 |
-1.255266 |
0 |
0 |
1 |
0 |
164 |
0.395999 |
1 |
0 |
2 |
0 |
164 |
-1.273443 |
2 |
0 |
3 |
0 |
164 |
-1.404883 |
3 |
0 |
4 |
0 |
164 |
-0.783187 |
4 |
0 |
... |
... |
... |
... |
... |
... |
49995 |
9999 |
786 |
-0.510697 |
0 |
0 |
49996 |
9999 |
786 |
1.994076 |
1 |
0 |
49997 |
9999 |
786 |
2.570977 |
2 |
0 |
49998 |
9999 |
786 |
0.535605 |
3 |
0 |
49999 |
9999 |
786 |
-0.202017 |
4 |
0 |
50000 rows × 5 columns
‘connectedness’ = ‘leave_out_observation’
|
i |
j |
y |
t |
m |
0 |
0 |
0 |
-1.255266 |
0 |
0 |
1 |
0 |
0 |
0.395999 |
1 |
0 |
2 |
0 |
0 |
-1.273443 |
2 |
0 |
3 |
0 |
0 |
-1.404883 |
3 |
0 |
4 |
0 |
0 |
-0.783187 |
4 |
0 |
... |
... |
... |
... |
... |
... |
44856 |
9014 |
876 |
-0.510697 |
0 |
0 |
44857 |
9014 |
876 |
1.994076 |
1 |
0 |
44858 |
9014 |
876 |
2.570977 |
2 |
0 |
44859 |
9014 |
876 |
0.535605 |
3 |
0 |
44860 |
9014 |
876 |
-0.202017 |
4 |
0 |
44861 rows × 5 columns
Connected sets for collapsed data
As mentioned above, connectedness is not necessarily maintained between non-collapsed and collapsed formats.
Here we show an example that demonstrates this, then show how setting 'collapse_at_connectedness_measure' = True
in your cleaning parameters dictionary will give the correct results, all in one line.
|
i |
j |
y |
t1 |
t2 |
w |
m |
0 |
0 |
0 |
-0.864156 |
0 |
4 |
5 |
0 |
1 |
1 |
1 |
-2.076316 |
0 |
4 |
5 |
0 |
2 |
2 |
2 |
-0.360893 |
0 |
4 |
5 |
0 |
3 |
3 |
3 |
-1.256533 |
0 |
4 |
5 |
0 |
4 |
4 |
4 |
1.233623 |
1 |
4 |
4 |
0 |
... |
... |
... |
... |
... |
... |
... |
... |
10891 |
9011 |
697 |
0.423780 |
0 |
4 |
5 |
0 |
10892 |
9012 |
476 |
0.385739 |
0 |
4 |
5 |
0 |
10893 |
9013 |
796 |
1.546216 |
0 |
0 |
1 |
1 |
10894 |
9013 |
381 |
-0.333299 |
1 |
4 |
4 |
1 |
10895 |
9014 |
876 |
0.877589 |
0 |
4 |
5 |
0 |
10896 rows × 7 columns
|
i |
j |
y |
t1 |
t2 |
w |
m |
0 |
0 |
0 |
-0.864156 |
0 |
4 |
5.0 |
0 |
1 |
1 |
1 |
-2.076316 |
0 |
4 |
5.0 |
0 |
2 |
2 |
2 |
-0.360893 |
0 |
4 |
5.0 |
0 |
3 |
3 |
3 |
-1.256533 |
0 |
4 |
5.0 |
0 |
4 |
4 |
4 |
1.233623 |
1 |
4 |
4.0 |
0 |
... |
... |
... |
... |
... |
... |
... |
... |
10878 |
8999 |
696 |
0.423780 |
0 |
4 |
5.0 |
0 |
10879 |
9000 |
475 |
0.385739 |
0 |
4 |
5.0 |
0 |
10880 |
9001 |
795 |
1.546216 |
0 |
0 |
1.0 |
1 |
10881 |
9001 |
380 |
-0.333299 |
1 |
4 |
4.0 |
1 |
10882 |
9002 |
875 |
0.877589 |
0 |
4 |
5.0 |
0 |
10883 rows × 7 columns
Simpler code
Instead of cleaning, collapsing, then cleaning again, we can do it all at once by specifying 'connectedness' = 'leave_out_spell'
(or 'leave_out_match'
) and 'collapse_at_connectedness_measure' = True
.
|
i |
j |
y |
t1 |
t2 |
w |
m |
0 |
0 |
0 |
-0.864156 |
0 |
4 |
5 |
0 |
1 |
1 |
1 |
-2.076316 |
0 |
4 |
5 |
0 |
2 |
2 |
2 |
-0.360893 |
0 |
4 |
5 |
0 |
3 |
3 |
3 |
-1.256533 |
0 |
4 |
5 |
0 |
4 |
4 |
4 |
1.233623 |
1 |
4 |
4 |
0 |
... |
... |
... |
... |
... |
... |
... |
... |
10878 |
8999 |
696 |
0.423780 |
0 |
4 |
5 |
0 |
10879 |
9000 |
475 |
0.385739 |
0 |
4 |
5 |
0 |
10880 |
9001 |
795 |
1.546216 |
0 |
0 |
1 |
1 |
10881 |
9001 |
380 |
-0.333299 |
1 |
4 |
4 |
1 |
10882 |
9002 |
875 |
0.877589 |
0 |
4 |
5 |
0 |
10883 rows × 7 columns