Clustering
Import the BipartitePandas package
Make sure to install it using pip install bipartitepandas
.
[1]:
import bipartitepandas as bpd
Get your data ready
For this notebook, we simulate data.
[2]:
df = bpd.SimBipartite().simulate()
bdf = bpd.BipartiteDataFrame(
i=df['i'], j=df['j'], y=df['y'], t=df['t']
).clean()
display(bdf)
checking required columns and datatypes
sorting rows
dropping NaN observations
generating 'm' column
keeping highest paying job for i-t (worker-year) duplicates (how='max')
dropping workers who leave a firm then return to it (how=False)
making 'i' ids contiguous
making 'j' ids contiguous
computing largest connected set (how=None)
sorting columns
resetting index
i | j | y | t | m | |
---|---|---|---|---|---|
0 | 0 | 92 | -0.112951 | 0 | 0 |
1 | 0 | 92 | 0.921270 | 1 | 0 |
2 | 0 | 92 | -0.733555 | 2 | 0 |
3 | 0 | 92 | 1.084979 | 3 | 0 |
4 | 0 | 92 | 1.404698 | 4 | 0 |
... | ... | ... | ... | ... | ... |
49995 | 9999 | 102 | 2.042254 | 0 | 1 |
49996 | 9999 | 175 | 2.445431 | 1 | 2 |
49997 | 9999 | 183 | 1.524885 | 2 | 2 |
49998 | 9999 | 158 | 2.641006 | 3 | 2 |
49999 | 9999 | 159 | 2.113294 | 4 | 1 |
50000 rows × 5 columns
Basic clustering
Clustering in BipartitePandas estimates firm groups.
Clustering is simple, just run .cluster()
- notice the new g
column!
[3]:
bdf = bdf.cluster()
display(bdf)
i | j | y | t | g | m | |
---|---|---|---|---|---|---|
0 | 0 | 92 | -0.112951 | 0 | 6 | 0 |
1 | 0 | 92 | 0.921270 | 1 | 6 | 0 |
2 | 0 | 92 | -0.733555 | 2 | 6 | 0 |
3 | 0 | 92 | 1.084979 | 3 | 6 | 0 |
4 | 0 | 92 | 1.404698 | 4 | 6 | 0 |
... | ... | ... | ... | ... | ... | ... |
49995 | 9999 | 102 | 2.042254 | 0 | 1 | 1 |
49996 | 9999 | 175 | 2.445431 | 1 | 4 | 2 |
49997 | 9999 | 183 | 1.524885 | 2 | 2 | 2 |
49998 | 9999 | 158 | 2.641006 | 3 | 7 | 2 |
49999 | 9999 | 159 | 2.113294 | 4 | 7 | 1 |
50000 rows × 6 columns
Advanced clustering
You can investigate all clustering parameters by running bpd.cluster_params().describe_all()
. We are going to go through some of the most important options.
Computing measures and selecting how to group on them
We compute measures using the bpd.measures
module, and group on the computed measures using the bpd.grouping
module.
Let’s use firm-level income cdfs as our measure, and group using KMeans.
[4]:
measures = bpd.measures.CDFs()
grouping = bpd.grouping.KMeans()
bdf = bdf.cluster(
bpd.cluster_params(
{
'measures': measures,
'grouping': grouping
}
)
)
display(bdf)
i | j | y | t | g | m | |
---|---|---|---|---|---|---|
0 | 0 | 92 | -0.112951 | 0 | 2 | 0 |
1 | 0 | 92 | 0.921270 | 1 | 2 | 0 |
2 | 0 | 92 | -0.733555 | 2 | 2 | 0 |
3 | 0 | 92 | 1.084979 | 3 | 2 | 0 |
4 | 0 | 92 | 1.404698 | 4 | 2 | 0 |
... | ... | ... | ... | ... | ... | ... |
49995 | 9999 | 102 | 2.042254 | 0 | 7 | 1 |
49996 | 9999 | 175 | 2.445431 | 1 | 1 | 2 |
49997 | 9999 | 183 | 1.524885 | 2 | 4 | 2 |
49998 | 9999 | 158 | 2.641006 | 3 | 5 | 2 |
49999 | 9999 | 159 | 2.113294 | 4 | 5 | 1 |
50000 rows × 6 columns
We can even group on multiple measures!
[5]:
measures = [
bpd.measures.CDFs(),
bpd.measures.Moments(measures=['mean', 'var'])
]
grouping = bpd.grouping.KMeans()
bdf = bdf.cluster(
bpd.cluster_params(
{
'measures': measures,
'grouping': grouping
}
)
)
display(bdf)
i | j | y | t | g | m | |
---|---|---|---|---|---|---|
0 | 0 | 92 | -0.112951 | 0 | 4 | 0 |
1 | 0 | 92 | 0.921270 | 1 | 4 | 0 |
2 | 0 | 92 | -0.733555 | 2 | 4 | 0 |
3 | 0 | 92 | 1.084979 | 3 | 4 | 0 |
4 | 0 | 92 | 1.404698 | 4 | 4 | 0 |
... | ... | ... | ... | ... | ... | ... |
49995 | 9999 | 102 | 2.042254 | 0 | 0 | 1 |
49996 | 9999 | 175 | 2.445431 | 1 | 3 | 2 |
49997 | 9999 | 183 | 1.524885 | 2 | 2 | 2 |
49998 | 9999 | 158 | 2.641006 | 3 | 7 | 2 |
49999 | 9999 | 159 | 2.113294 | 4 | 7 | 1 |
50000 rows × 6 columns
Clustering on subsets of the data - stayers/movers/stays/moves
What if we want our measures to be computed with only movers or only stayers? We can specify stayers_movers
. Note that some firms may not be clustered - these firms will have g=pd.NA
(set 'dropna': True
if you want to drop firms that don’t get clustered).
[6]:
bdf = bdf.cluster(
bpd.cluster_params(
{
'stayers_movers': 'movers'
}
)
)
display(bdf)
i | j | y | t | g | m | |
---|---|---|---|---|---|---|
0 | 0 | 92 | -0.112951 | 0 | 4 | 0 |
1 | 0 | 92 | 0.921270 | 1 | 4 | 0 |
2 | 0 | 92 | -0.733555 | 2 | 4 | 0 |
3 | 0 | 92 | 1.084979 | 3 | 4 | 0 |
4 | 0 | 92 | 1.404698 | 4 | 4 | 0 |
... | ... | ... | ... | ... | ... | ... |
49995 | 9999 | 102 | 2.042254 | 0 | 4 | 1 |
49996 | 9999 | 175 | 2.445431 | 1 | 0 | 2 |
49997 | 9999 | 183 | 1.524885 | 2 | 5 | 2 |
49998 | 9999 | 158 | 2.641006 | 3 | 2 | 2 |
49999 | 9999 | 159 | 2.113294 | 4 | 2 | 1 |
50000 rows × 6 columns
Clustering on subsets of the data - time
On the other hand, what if we want to cluster on particular periods of data? We can specify t
. Again, note that some firms may not be clustered - these firms will have g=pd.NA
(set 'dropna': True
if you want to drop firms that don’t get clustered).
[7]:
bdf = bdf.cluster(
bpd.cluster_params(
{
't': [0, 1, 2]
}
)
)
display(bdf)
i | j | y | t | g | m | |
---|---|---|---|---|---|---|
0 | 0 | 92 | -0.112951 | 0 | 3 | 0 |
1 | 0 | 92 | 0.921270 | 1 | 3 | 0 |
2 | 0 | 92 | -0.733555 | 2 | 3 | 0 |
3 | 0 | 92 | 1.084979 | 3 | 3 | 0 |
4 | 0 | 92 | 1.404698 | 4 | 3 | 0 |
... | ... | ... | ... | ... | ... | ... |
49995 | 9999 | 102 | 2.042254 | 0 | 3 | 1 |
49996 | 9999 | 175 | 2.445431 | 1 | 5 | 2 |
49997 | 9999 | 183 | 1.524885 | 2 | 4 | 2 |
49998 | 9999 | 158 | 2.641006 | 3 | 0 | 2 |
49999 | 9999 | 159 | 2.113294 | 4 | 0 | 1 |
50000 rows × 6 columns