Clustering

Import the BipartitePandas package

Make sure to install it using pip install bipartitepandas.

[1]:
import bipartitepandas as bpd

Get your data ready

For this notebook, we simulate data.

[2]:
df = bpd.SimBipartite().simulate()
bdf = bpd.BipartiteDataFrame(
    i=df['i'], j=df['j'], y=df['y'], t=df['t']
).clean()
display(bdf)
checking required columns and datatypes
sorting rows
dropping NaN observations
generating 'm' column
keeping highest paying job for i-t (worker-year) duplicates (how='max')
dropping workers who leave a firm then return to it (how=False)
making 'i' ids contiguous
making 'j' ids contiguous
computing largest connected set (how=None)
sorting columns
resetting index
i j y t m
0 0 92 -0.112951 0 0
1 0 92 0.921270 1 0
2 0 92 -0.733555 2 0
3 0 92 1.084979 3 0
4 0 92 1.404698 4 0
... ... ... ... ... ...
49995 9999 102 2.042254 0 1
49996 9999 175 2.445431 1 2
49997 9999 183 1.524885 2 2
49998 9999 158 2.641006 3 2
49999 9999 159 2.113294 4 1

50000 rows × 5 columns

Basic clustering

Clustering in BipartitePandas estimates firm groups.

Clustering is simple, just run .cluster() - notice the new g column!

[3]:
bdf = bdf.cluster()
display(bdf)
i j y t g m
0 0 92 -0.112951 0 6 0
1 0 92 0.921270 1 6 0
2 0 92 -0.733555 2 6 0
3 0 92 1.084979 3 6 0
4 0 92 1.404698 4 6 0
... ... ... ... ... ... ...
49995 9999 102 2.042254 0 1 1
49996 9999 175 2.445431 1 4 2
49997 9999 183 1.524885 2 2 2
49998 9999 158 2.641006 3 7 2
49999 9999 159 2.113294 4 7 1

50000 rows × 6 columns

Advanced clustering

You can investigate all clustering parameters by running bpd.cluster_params().describe_all(). We are going to go through some of the most important options.

Computing measures and selecting how to group on them

We compute measures using the bpd.measures module, and group on the computed measures using the bpd.grouping module.

Let’s use firm-level income cdfs as our measure, and group using KMeans.

[4]:
measures = bpd.measures.CDFs()
grouping = bpd.grouping.KMeans()
bdf = bdf.cluster(
    bpd.cluster_params(
        {
            'measures': measures,
            'grouping': grouping
        }
    )
)
display(bdf)
i j y t g m
0 0 92 -0.112951 0 2 0
1 0 92 0.921270 1 2 0
2 0 92 -0.733555 2 2 0
3 0 92 1.084979 3 2 0
4 0 92 1.404698 4 2 0
... ... ... ... ... ... ...
49995 9999 102 2.042254 0 7 1
49996 9999 175 2.445431 1 1 2
49997 9999 183 1.524885 2 4 2
49998 9999 158 2.641006 3 5 2
49999 9999 159 2.113294 4 5 1

50000 rows × 6 columns

We can even group on multiple measures!

[5]:
measures = [
    bpd.measures.CDFs(),
    bpd.measures.Moments(measures=['mean', 'var'])
]
grouping = bpd.grouping.KMeans()
bdf = bdf.cluster(
    bpd.cluster_params(
        {
            'measures': measures,
            'grouping': grouping
        }
    )
)
display(bdf)
i j y t g m
0 0 92 -0.112951 0 4 0
1 0 92 0.921270 1 4 0
2 0 92 -0.733555 2 4 0
3 0 92 1.084979 3 4 0
4 0 92 1.404698 4 4 0
... ... ... ... ... ... ...
49995 9999 102 2.042254 0 0 1
49996 9999 175 2.445431 1 3 2
49997 9999 183 1.524885 2 2 2
49998 9999 158 2.641006 3 7 2
49999 9999 159 2.113294 4 7 1

50000 rows × 6 columns

Clustering on subsets of the data - stayers/movers/stays/moves

What if we want our measures to be computed with only movers or only stayers? We can specify stayers_movers. Note that some firms may not be clustered - these firms will have g=pd.NA (set 'dropna': True if you want to drop firms that don’t get clustered).

[6]:
bdf = bdf.cluster(
    bpd.cluster_params(
        {
            'stayers_movers': 'movers'
        }
    )
)
display(bdf)
i j y t g m
0 0 92 -0.112951 0 4 0
1 0 92 0.921270 1 4 0
2 0 92 -0.733555 2 4 0
3 0 92 1.084979 3 4 0
4 0 92 1.404698 4 4 0
... ... ... ... ... ... ...
49995 9999 102 2.042254 0 4 1
49996 9999 175 2.445431 1 0 2
49997 9999 183 1.524885 2 5 2
49998 9999 158 2.641006 3 2 2
49999 9999 159 2.113294 4 2 1

50000 rows × 6 columns

Clustering on subsets of the data - time

On the other hand, what if we want to cluster on particular periods of data? We can specify t. Again, note that some firms may not be clustered - these firms will have g=pd.NA (set 'dropna': True if you want to drop firms that don’t get clustered).

[7]:
bdf = bdf.cluster(
    bpd.cluster_params(
        {
            't': [0, 1, 2]
        }
    )
)
display(bdf)
i j y t g m
0 0 92 -0.112951 0 3 0
1 0 92 0.921270 1 3 0
2 0 92 -0.733555 2 3 0
3 0 92 1.084979 3 3 0
4 0 92 1.404698 4 3 0
... ... ... ... ... ... ...
49995 9999 102 2.042254 0 3 1
49996 9999 175 2.445431 1 5 2
49997 9999 183 1.524885 2 4 2
49998 9999 158 2.641006 3 0 2
49999 9999 159 2.113294 4 0 1

50000 rows × 6 columns