{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Connected sets\n", "\n", "## Import the BipartitePandas package\n", "\n", "Make sure to install it using `pip install bipartitepandas`." ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "import bipartitepandas as bpd" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Get your data ready\n", "\n", "For this notebook, we simulate data (we set parameters to make the connected sets interesting)." ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
ijyt
00164-1.2552660
101640.3959991
20164-1.2734432
30164-1.4048833
40164-0.7831874
...............
499959999786-0.5106970
4999699997861.9940761
4999799997862.5709772
4999899997860.5356053
499999999786-0.2020174
\n", "

50000 rows × 4 columns

\n", "
" ], "text/plain": [ " i j y t\n", "0 0 164 -1.255266 0\n", "1 0 164 0.395999 1\n", "2 0 164 -1.273443 2\n", "3 0 164 -1.404883 3\n", "4 0 164 -0.783187 4\n", "... ... ... ... ..\n", "49995 9999 786 -0.510697 0\n", "49996 9999 786 1.994076 1\n", "49997 9999 786 2.570977 2\n", "49998 9999 786 0.535605 3\n", "49999 9999 786 -0.202017 4\n", "\n", "[50000 rows x 4 columns]" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "df = bpd.SimBipartite(\n", " bpd.sim_params(\n", " {\n", " 'firm_size': 10,\n", " 'p_move': 0.05\n", " }\n", " )\n", ").simulate()\n", "bdf = bpd.BipartiteDataFrame(\n", " i=df['i'], j=df['j'], y=df['y'], t=df['t']\n", ")\n", "display(bdf)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Computing connected sets\n", "\n", "There are eight connectedness options:\n", "\n", "- None\n", "- Connected\n", "- Strongly connected\n", "- Leave-out-observation\n", "- Leave-out-spell\n", "- Leave-out-match\n", "- Leave-out-worker\n", "- Leave-out-firm\n", "\n", "These are specified in the cleaning parameters dictionary under the key `'connectedness'`. We will demonstrate `'connectedness' = None` and `'connectedness' = 'leave_out_observation'`.\n", "\n", "
\n", "\n", "Note\n", "\n", "Leave-out-spell and leave-out-match are distinguished by workers who leave a firm then return to it.\n", "\n", "
\n", "\n", "Note\n", "\n", "Stayers who have only a single observation after computing the largest connected set can be dropped by specifying `'drop_single_stayers' = True` in your cleaning parameters dictionary.\n", "\n", "\n", "\n", "
\n", "\n", "Warning\n", "\n", "Connectedness is not necessarily maintained between non-collapsed and collapsed formats. Therefore, if you plan to use connected, collapsed data, it is recommended to set the connectedness level at the level at which you would to collapse your data, and to set `'collapse_at_connectedness_measure' = True` in your cleaning parameters dictionary. An example is given below.\n", "\n", "
" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 'connectedness' = None" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
ijytm
00164-1.25526600
101640.39599910
20164-1.27344320
30164-1.40488330
40164-0.78318740
..................
499959999786-0.51069700
4999699997861.99407610
4999799997862.57097720
4999899997860.53560530
499999999786-0.20201740
\n", "

50000 rows × 5 columns

\n", "
" ], "text/plain": [ " i j y t m\n", "0 0 164 -1.255266 0 0\n", "1 0 164 0.395999 1 0\n", "2 0 164 -1.273443 2 0\n", "3 0 164 -1.404883 3 0\n", "4 0 164 -0.783187 4 0\n", "... ... ... ... .. ..\n", "49995 9999 786 -0.510697 0 0\n", "49996 9999 786 1.994076 1 0\n", "49997 9999 786 2.570977 2 0\n", "49998 9999 786 0.535605 3 0\n", "49999 9999 786 -0.202017 4 0\n", "\n", "[50000 rows x 5 columns]" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "conn_none = bdf.clean(\n", " bpd.clean_params(\n", " {\n", " 'connectedness': None,\n", " 'verbose': False\n", " }\n", " )\n", ")\n", "display(conn_none)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 'connectedness' = 'leave_out_observation'" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
ijytm
000-1.25526600
1000.39599910
200-1.27344320
300-1.40488330
400-0.78318740
..................
448569014876-0.51069700
4485790148761.99407610
4485890148762.57097720
4485990148760.53560530
448609014876-0.20201740
\n", "

44861 rows × 5 columns

\n", "
" ], "text/plain": [ " i j y t m\n", "0 0 0 -1.255266 0 0\n", "1 0 0 0.395999 1 0\n", "2 0 0 -1.273443 2 0\n", "3 0 0 -1.404883 3 0\n", "4 0 0 -0.783187 4 0\n", "... ... ... ... .. ..\n", "44856 9014 876 -0.510697 0 0\n", "44857 9014 876 1.994076 1 0\n", "44858 9014 876 2.570977 2 0\n", "44859 9014 876 0.535605 3 0\n", "44860 9014 876 -0.202017 4 0\n", "\n", "[44861 rows x 5 columns]" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "conn_loo = bdf.clean(\n", " bpd.clean_params(\n", " {\n", " 'connectedness': 'leave_out_observation',\n", " 'verbose': False\n", " }\n", " )\n", ")\n", "display(conn_loo)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Connected sets for collapsed data\n", "\n", "As mentioned above, connectedness is not necessarily maintained between non-collapsed and collapsed formats.\n", "\n", "Here we show an example that demonstrates this, then show how setting `'collapse_at_connectedness_measure' = True` in your cleaning parameters dictionary will give the correct results, all in one line." ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
ijyt1t2wm
000-0.8641560450
111-2.0763160450
222-0.3608930450
333-1.2565330450
4441.2336231440
........................
1089190116970.4237800450
1089290124760.3857390450
1089390137961.5462160011
108949013381-0.3332991441
1089590148760.8775890450
\n", "

10896 rows × 7 columns

\n", "
" ], "text/plain": [ " i j y t1 t2 w m\n", "0 0 0 -0.864156 0 4 5 0\n", "1 1 1 -2.076316 0 4 5 0\n", "2 2 2 -0.360893 0 4 5 0\n", "3 3 3 -1.256533 0 4 5 0\n", "4 4 4 1.233623 1 4 4 0\n", "... ... ... ... .. .. .. ..\n", "10891 9011 697 0.423780 0 4 5 0\n", "10892 9012 476 0.385739 0 4 5 0\n", "10893 9013 796 1.546216 0 0 1 1\n", "10894 9013 381 -0.333299 1 4 4 1\n", "10895 9014 876 0.877589 0 4 5 0\n", "\n", "[10896 rows x 7 columns]" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "coll_conn_loo_wrong = conn_loo.collapse(level='spell')\n", "display(coll_conn_loo_wrong)" ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
ijyt1t2wm
000-0.864156045.00
111-2.076316045.00
222-0.360893045.00
333-1.256533045.00
4441.233623144.00
........................
1087889996960.423780045.00
1087990004750.385739045.00
1088090017951.546216001.01
108819001380-0.333299144.01
1088290028750.877589045.00
\n", "

10883 rows × 7 columns

\n", "
" ], "text/plain": [ " i j y t1 t2 w m\n", "0 0 0 -0.864156 0 4 5.0 0\n", "1 1 1 -2.076316 0 4 5.0 0\n", "2 2 2 -0.360893 0 4 5.0 0\n", "3 3 3 -1.256533 0 4 5.0 0\n", "4 4 4 1.233623 1 4 4.0 0\n", "... ... ... ... .. .. ... ..\n", "10878 8999 696 0.423780 0 4 5.0 0\n", "10879 9000 475 0.385739 0 4 5.0 0\n", "10880 9001 795 1.546216 0 0 1.0 1\n", "10881 9001 380 -0.333299 1 4 4.0 1\n", "10882 9002 875 0.877589 0 4 5.0 0\n", "\n", "[10883 rows x 7 columns]" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "coll_conn_loo_right_1 = bdf.clean(\n", " bpd.clean_params(\n", " {\n", " 'connectedness': None,\n", " 'verbose': False\n", " }\n", " )\n", ").collapse(level='spell').clean(\n", " bpd.clean_params(\n", " {\n", " 'connectedness': 'leave_out_observation',\n", " 'verbose': False\n", " }\n", " )\n", ")\n", "display(coll_conn_loo_right_1)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Simpler code\n", "\n", "Instead of cleaning, collapsing, then cleaning again, we can do it all at once by specifying `'connectedness' = 'leave_out_spell'` (or `'leave_out_match'`) and `'collapse_at_connectedness_measure' = True`." ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
ijyt1t2wm
000-0.8641560450
111-2.0763160450
222-0.3608930450
333-1.2565330450
4441.2336231440
........................
1087889996960.4237800450
1087990004750.3857390450
1088090017951.5462160011
108819001380-0.3332991441
1088290028750.8775890450
\n", "

10883 rows × 7 columns

\n", "
" ], "text/plain": [ " i j y t1 t2 w m\n", "0 0 0 -0.864156 0 4 5 0\n", "1 1 1 -2.076316 0 4 5 0\n", "2 2 2 -0.360893 0 4 5 0\n", "3 3 3 -1.256533 0 4 5 0\n", "4 4 4 1.233623 1 4 4 0\n", "... ... ... ... .. .. .. ..\n", "10878 8999 696 0.423780 0 4 5 0\n", "10879 9000 475 0.385739 0 4 5 0\n", "10880 9001 795 1.546216 0 0 1 1\n", "10881 9001 380 -0.333299 1 4 4 1\n", "10882 9002 875 0.877589 0 4 5 0\n", "\n", "[10883 rows x 7 columns]" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "coll_conn_loo_right_2 = bdf.clean(\n", " bpd.clean_params(\n", " {\n", " 'connectedness': 'leave_out_spell',\n", " 'collapse_at_connectedness_measure': True,\n", " 'verbose': False\n", " }\n", " )\n", ")\n", "display(coll_conn_loo_right_2)" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3.10.4 ('tw-env')", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.10.4" }, "vscode": { "interpreter": { "hash": "cc42643cd328f586448453625167b9fc1f3293a5a6342212b4e72707ccae656b" } } }, "nbformat": 4, "nbformat_minor": 4 }