Clustering

Import the BipartitePandas package

Make sure to install it using pip install bipartitepandas.

[1]:

import bipartitepandas as bpd

Get your data ready

For this notebook, we simulate data.

[2]:

df = bpd.SimBipartite().simulate()
bdf = bpd.BipartiteDataFrame(
    i=df['i'], j=df['j'], y=df['y'], t=df['t']
).clean()
display(bdf)

checking required columns and datatypes
sorting rows
dropping NaN observations
generating 'm' column
keeping highest paying job for i-t (worker-year) duplicates (how='max')
dropping workers who leave a firm then return to it (how=False)
making 'i' ids contiguous
making 'j' ids contiguous
computing largest connected set (how=None)
sorting columns
resetting index

	i	j	y	t	m
0	0	92	-0.112951	0	0
1	0	92	0.921270	1	0
2	0	92	-0.733555	2	0
3	0	92	1.084979	3	0
4	0	92	1.404698	4	0
...	...	...	...	...	...
49995	9999	102	2.042254	0	1
49996	9999	175	2.445431	1	2
49997	9999	183	1.524885	2	2
49998	9999	158	2.641006	3	2
49999	9999	159	2.113294	4	1

50000 rows × 5 columns

Basic clustering

Clustering in BipartitePandas estimates firm groups.

Clustering is simple, just run .cluster() - notice the new g column!

[3]:

bdf = bdf.cluster()
display(bdf)

	i	j	y	t	g	m
0	0	92	-0.112951	0	6	0
1	0	92	0.921270	1	6	0
2	0	92	-0.733555	2	6	0
3	0	92	1.084979	3	6	0
4	0	92	1.404698	4	6	0
...	...	...	...	...	...	...
49995	9999	102	2.042254	0	1	1
49996	9999	175	2.445431	1	4	2
49997	9999	183	1.524885	2	2	2
49998	9999	158	2.641006	3	7	2
49999	9999	159	2.113294	4	7	1

50000 rows × 6 columns

Advanced clustering

You can investigate all clustering parameters by running bpd.cluster_params().describe_all(). We are going to go through some of the most important options.

Computing measures and selecting how to group on them

We compute measures using the bpd.measures module, and group on the computed measures using the bpd.grouping module.

Let’s use firm-level income cdfs as our measure, and group using KMeans.

[4]:

measures = bpd.measures.CDFs()
grouping = bpd.grouping.KMeans()
bdf = bdf.cluster(
    bpd.cluster_params(
        {
            'measures': measures,
            'grouping': grouping
        }
    )
)
display(bdf)

	i	j	y	t	g	m
0	0	92	-0.112951	0	2	0
1	0	92	0.921270	1	2	0
2	0	92	-0.733555	2	2	0
3	0	92	1.084979	3	2	0
4	0	92	1.404698	4	2	0
...	...	...	...	...	...	...
49995	9999	102	2.042254	0	7	1
49996	9999	175	2.445431	1	1	2
49997	9999	183	1.524885	2	4	2
49998	9999	158	2.641006	3	5	2
49999	9999	159	2.113294	4	5	1

50000 rows × 6 columns

We can even group on multiple measures!

[5]:

measures = [
    bpd.measures.CDFs(),
    bpd.measures.Moments(measures=['mean', 'var'])
]
grouping = bpd.grouping.KMeans()
bdf = bdf.cluster(
    bpd.cluster_params(
        {
            'measures': measures,
            'grouping': grouping
        }
    )
)
display(bdf)

	i	j	y	t	g	m
0	0	92	-0.112951	0	4	0
1	0	92	0.921270	1	4	0
2	0	92	-0.733555	2	4	0
3	0	92	1.084979	3	4	0
4	0	92	1.404698	4	4	0
...	...	...	...	...	...	...
49995	9999	102	2.042254	0	0	1
49996	9999	175	2.445431	1	3	2
49997	9999	183	1.524885	2	2	2
49998	9999	158	2.641006	3	7	2
49999	9999	159	2.113294	4	7	1

50000 rows × 6 columns

Clustering on subsets of the data - stayers/movers/stays/moves

What if we want our measures to be computed with only movers or only stayers? We can specify stayers_movers. Note that some firms may not be clustered - these firms will have g=pd.NA (set 'dropna': True if you want to drop firms that don’t get clustered).

[6]:

bdf = bdf.cluster(
    bpd.cluster_params(
        {
            'stayers_movers': 'movers'
        }
    )
)
display(bdf)

	i	j	y	t	g	m
0	0	92	-0.112951	0	4	0
1	0	92	0.921270	1	4	0
2	0	92	-0.733555	2	4	0
3	0	92	1.084979	3	4	0
4	0	92	1.404698	4	4	0
...	...	...	...	...	...	...
49995	9999	102	2.042254	0	4	1
49996	9999	175	2.445431	1	0	2
49997	9999	183	1.524885	2	5	2
49998	9999	158	2.641006	3	2	2
49999	9999	159	2.113294	4	2	1

50000 rows × 6 columns

Clustering on subsets of the data - time

On the other hand, what if we want to cluster on particular periods of data? We can specify t. Again, note that some firms may not be clustered - these firms will have g=pd.NA (set 'dropna': True if you want to drop firms that don’t get clustered).

[7]:

bdf = bdf.cluster(
    bpd.cluster_params(
        {
            't': [0, 1, 2]
        }
    )
)
display(bdf)

	i	j	y	t	g	m
0	0	92	-0.112951	0	3	0
1	0	92	0.921270	1	3	0
2	0	92	-0.733555	2	3	0
3	0	92	1.084979	3	3	0
4	0	92	1.404698	4	3	0
...	...	...	...	...	...	...
49995	9999	102	2.042254	0	3	1
49996	9999	175	2.445431	1	5	2
49997	9999	183	1.524885	2	4	2
49998	9999	158	2.641006	3	0	2
49999	9999	159	2.113294	4	0	1

50000 rows × 6 columns