{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Advanced features\n", "\n", "## Import the BipartitePandas package\n", "\n", "Make sure to install it using `pip install bipartitepandas`." ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "import bipartitepandas as bpd" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Get your data ready\n", "\n", "For this notebook, we simulate data." ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
ijyt
0066-0.8842220
1049-2.2249771
2049-1.7114292
30115-0.0824403
406-1.1905404
...............
4999599991460.5052000
499969999114-0.0952901
499979999114-1.2407152
4999899991140.7013393
499999999114-0.6102464
\n", "

50000 rows × 4 columns

\n", "
" ], "text/plain": [ " i j y t\n", "0 0 66 -0.884222 0\n", "1 0 49 -2.224977 1\n", "2 0 49 -1.711429 2\n", "3 0 115 -0.082440 3\n", "4 0 6 -1.190540 4\n", "... ... ... ... ..\n", "49995 9999 146 0.505200 0\n", "49996 9999 114 -0.095290 1\n", "49997 9999 114 -1.240715 2\n", "49998 9999 114 0.701339 3\n", "49999 9999 114 -0.610246 4\n", "\n", "[50000 rows x 4 columns]" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "df = bpd.SimBipartite().simulate()\n", "bdf = bpd.BipartiteDataFrame(\n", " i=df['i'], j=df['j'], y=df['y'], t=df['t']\n", ")\n", "display(bdf)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Advanced data cleaning\n", "\n", "
\n", "\n", "Hint\n", "\n", "Want details on all cleaning parameters? Run `bpd.clean_params().describe_all()`, or search through `bpd.clean_params().keys()` for a particular key, and then run `bpd.clean_params().describe(key)`.\n", "\n", "
\n", "\n", "#### Set how to handle worker-year duplicates\n", "\n", "Use the parameter `i_t_how` to customize how worker-year duplicates are handled.\n", "\n", "#### Collapse at the match-level\n", "\n", "If you drop the `t` column, collapsing will automatically collapse at the match level. However, this prevents conversion to event study format (this can be bypassed with the `.construct_artificial_time()` method, but the data will likely have a meaningless order, rendering the event study uninterpretable).\n", "\n", "#### Avoid unnecessary copies\n", "\n", "If you are working with a large dataset, you will want to avoid copies whenever possible. So set `copy=False`.\n", "\n", "#### Avoid unnecessary sorts\n", "\n", "If you know your data is sorted by `i` and `t` (or, if you aren't including a `t` column, just by `i`), then set `is_sorted=True`.\n", "\n", "#### Avoid complicated loops\n", "\n", "Sometimes workers leave a firm, then return to it (we call these workers *returners*). Returners can cause computational difficulties because sometimes intermediate firms get dropped (e.g. a worker goes from firm $A \\to B \\to A$, but firm $B$ gets dropped). This turns returners into stayers. This can change the largest connected set of firms, and if data is in collapsed format, requires the data to be recollapsed.\n", "\n", "Because of these potential complications, if there are returners, many methods require loops that run until convergence.\n", "\n", "These difficulties can be avoided by setting the parameter `drop_returns` (there are multiple ways to handle returners, they can be seen by running `bpd.clean_params().describe('drop_returns')`).\n", "\n", "
\n", "\n", "Alternative\n", "\n", "Another way to handle returners is to drop the `t` column. Then, sorting will automatically sort by `i` and `j`, which eliminates the possibility of returners. However, this prevents conversion to event study format (this can be bypassed with the `.construct_artificial_time()` method, but the data will likely have a meaningless order, rendering the event study uninterpretable).\n", "\n", "
\n", "\n", "## Advanced clustering\n", "\n", "#### Install Intel(R) Extension for Scikit-learn\n", "\n", "Intel(R) Extension for Scikit-learn ([GitHub](https://github.com/intel/scikit-learn-intelex)) can speed up KMeans clustering.\n", "\n", "## Advanced dataframe handling\n", "\n", "#### Disable logging\n", "\n", "Logging can slow down basic operations on BipartitePandas dataframes (e.g. data cleaning). Set the parameter `log=False` when constructing your dataframe to turn off logging.\n", "\n", "#### Use method chaining with in-place operations\n", "\n", "Unlike standard Pandas, BipartitePandas allows method chaining with in-place operations.\n", "\n", "#### Understand the difference between general columns and subcolumns\n", "\n", "Users interact with general columns, while BipartitePandas dataframes display subcolumns. As an example, for event study format, the columns for firm clusters are labeled `g1` and `g2`. These are the subcolumns for general column `g`. If you want to drop firm clusters from the dataframe, rather than dropping `g1` and `g2` separately, you must drop the general column `g`. This paradigm applies throughout BipartitePandas and the documentation will make clear when you should specify general columns.\n", "\n", "#### Bypass restrictions\n", "\n", "Sometimes BipartitePandas imposes restrictions that you may want to bypass. While BipartitePandas does not provide an explicit way to disable these restrictions, you can bypass them by converting your data into a Pandas dataframe, running the code that was formerly restricted, then converting it back into a BipartitePandas dataframe.\n", "\n", "#### Simpler constructor\n", "\n", "If the columns in your Pandas dataframe are already named correctly, you can simply put the dataframe as a parameter into the BipartitePandas dataframe constructor. Here is an example:" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "checking required columns and datatypes\n", "sorting rows\n", "dropping NaN observations\n", "generating 'm' column\n", "keeping highest paying job for i-t (worker-year) duplicates (how='max')\n", "dropping workers who leave a firm then return to it (how=False)\n", "making 'i' ids contiguous\n", "making 'j' ids contiguous\n", "computing largest connected set (how=None)\n", "sorting columns\n", "resetting index\n" ] }, { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
ijytmalphaklpsi
0066-0.88422201-0.96742230-0.348756
1049-2.22497711-0.96742220-0.604585
2049-1.71142921-0.96742220-0.604585
30115-0.08244032-0.967422500.114185
406-1.19054041-0.96742200-1.335178
..............................
4999599991460.505200010.000000720.604585
499969999114-0.095290110.000000520.114185
499979999114-1.240715200.000000520.114185
4999899991140.701339300.000000520.114185
499999999114-0.610246400.000000520.114185
\n", "

50000 rows × 9 columns

\n", "
" ], "text/plain": [ " i j y t m alpha k l psi\n", "0 0 66 -0.884222 0 1 -0.967422 3 0 -0.348756\n", "1 0 49 -2.224977 1 1 -0.967422 2 0 -0.604585\n", "2 0 49 -1.711429 2 1 -0.967422 2 0 -0.604585\n", "3 0 115 -0.082440 3 2 -0.967422 5 0 0.114185\n", "4 0 6 -1.190540 4 1 -0.967422 0 0 -1.335178\n", "... ... ... ... .. .. ... .. .. ...\n", "49995 9999 146 0.505200 0 1 0.000000 7 2 0.604585\n", "49996 9999 114 -0.095290 1 1 0.000000 5 2 0.114185\n", "49997 9999 114 -1.240715 2 0 0.000000 5 2 0.114185\n", "49998 9999 114 0.701339 3 0 0.000000 5 2 0.114185\n", "49999 9999 114 -0.610246 4 0 0.000000 5 2 0.114185\n", "\n", "[50000 rows x 9 columns]" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "bdf = bpd.BipartiteDataFrame(df).clean()\n", "display(bdf)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Restore original ids\n", "\n", "To restore original ids, we need to make sure the dataframe is tracking ids as they change.\n", "\n", "We make sure the dataframe tracks ids as they change by setting `track_id_changes=True`.\n", "\n", "Notice that in this example we use `j / 2`, so that `j` will be modified during data cleaning.\n", "\n", "The method `.original_ids()` will then return a dataframe that merges in the original ids." ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "checking required columns and datatypes\n", "sorting rows\n", "dropping NaN observations\n", "generating 'm' column\n", "keeping highest paying job for i-t (worker-year) duplicates (how='max')\n", "dropping workers who leave a firm then return to it (how=False)\n", "making 'i' ids contiguous\n", "making 'j' ids contiguous\n", "computing largest connected set (how=None)\n", "sorting columns\n", "resetting index\n" ] }, { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
ijytmoriginal_j
000-0.8842220133.0
101-2.2249771124.5
201-1.7114292124.5
302-0.0824403257.5
403-1.190540413.0
.....................
4999599991090.5052000173.0
49996999928-0.0952901157.0
49997999928-1.2407152057.0
499989999280.7013393057.0
49999999928-0.6102464057.0
\n", "

50000 rows × 6 columns

\n", "
" ], "text/plain": [ " i j y t m original_j\n", "0 0 0 -0.884222 0 1 33.0\n", "1 0 1 -2.224977 1 1 24.5\n", "2 0 1 -1.711429 2 1 24.5\n", "3 0 2 -0.082440 3 2 57.5\n", "4 0 3 -1.190540 4 1 3.0\n", "... ... ... ... .. .. ...\n", "49995 9999 109 0.505200 0 1 73.0\n", "49996 9999 28 -0.095290 1 1 57.0\n", "49997 9999 28 -1.240715 2 0 57.0\n", "49998 9999 28 0.701339 3 0 57.0\n", "49999 9999 28 -0.610246 4 0 57.0\n", "\n", "[50000 rows x 6 columns]" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "bdf_original_ids = bpd.BipartiteDataFrame(\n", " i=df['i'], j=(df['j'] / 2), y=df['y'], t=df['t'],\n", " track_id_changes=True\n", ").clean()\n", "display(bdf_original_ids.original_ids())" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Retrieve and update custom column properties\n", "\n", "Custom column properties can be retrived with the method `.get_column_properties()` and updated with the method `.set_column_properties()`." ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "{'general_column': 'alpha',\n", " 'subcolumns': 'alpha',\n", " 'dtype': 'float',\n", " 'is_categorical': False,\n", " 'how_collapse': 'mean',\n", " 'long_es_split': True}" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/plain": [ "{'general_column': 'alpha',\n", " 'subcolumns': 'alpha',\n", " 'dtype': 'categorical',\n", " 'is_categorical': True,\n", " 'how_collapse': 'first',\n", " 'long_es_split': True}" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "display(bdf.get_column_properties('alpha'))\n", "bdf = bdf.set_column_properties(\n", " 'alpha', is_categorical=True, dtype='categorical'\n", ")\n", "display(bdf.get_column_properties('alpha'))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Compare dataframes\n", "\n", "Dataframes can be compared using the utility function `bpd.util.compare_frames()`." ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "True" ] }, "execution_count": 6, "metadata": {}, "output_type": "execute_result" } ], "source": [ "bpd.util.compare_frames(\n", " bdf, bdf.iloc[:len(bdf) // 2],\n", " size_variable='len', operator='geq'\n", ")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Fill in missing periods as unemployed\n", "\n", "The method `.fill_missing_periods()` (for *Long* format) will fill in rows for missing intermediate periods.\n", "\n", "
\n", "\n", "Hint\n", "\n", "Filling in missing periods is a useful way to make sure that `.collapse()` only collapses over worker-firm spells if they are for categorical periods.\n", "\n", "
\n", "\n", "In this example, we drop periods 1-3, then fill them in, setting `k` (firm type) to become $-1$ and `l` (worker type) to become its previous value:" ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "checking required columns and datatypes\n", "sorting rows\n", "dropping NaN observations\n", "generating 'm' column\n", "keeping highest paying job for i-t (worker-year) duplicates (how='max')\n", "dropping workers who leave a firm then return to it (how=False)\n", "making 'i' ids contiguous\n", "making 'j' ids contiguous\n", "making 'alpha' ids contiguous\n", "computing largest connected set (how=None)\n", "sorting columns\n", "resetting index\n" ] }, { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
ijytmalphaklpsi
0066-0.88422201030-0.348756
10-1<NA>1<NA><NA>-10<NA>
20-1<NA>2<NA><NA>-10<NA>
30-1<NA>3<NA><NA>-10<NA>
406-1.1905441000-1.335178
..............................
4999599991460.5052013720.604585
499969999-1<NA>1<NA><NA>-12<NA>
499979999-1<NA>2<NA><NA>-12<NA>
499989999-1<NA>3<NA><NA>-12<NA>
499999999114-0.610246413520.114185
\n", "

50000 rows × 9 columns

\n", "
" ], "text/plain": [ " i j y t m alpha k l psi\n", "0 0 66 -0.884222 0 1 0 3 0 -0.348756\n", "1 0 -1 1 -1 0 \n", "2 0 -1 2 -1 0 \n", "3 0 -1 3 -1 0 \n", "4 0 6 -1.19054 4 1 0 0 0 -1.335178\n", "... ... ... ... .. ... ... .. .. ...\n", "49995 9999 146 0.5052 0 1 3 7 2 0.604585\n", "49996 9999 -1 1 -1 2 \n", "49997 9999 -1 2 -1 2 \n", "49998 9999 -1 3 -1 2 \n", "49999 9999 114 -0.610246 4 1 3 5 2 0.114185\n", "\n", "[50000 rows x 9 columns]" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "bdf_missing = bdf[\n", " (bdf['t'] == 0) | (bdf['t'] == 4)\n", "].clean()\n", "bdf_fill_missing = bdf_missing.fill_missing_periods(\n", " {\n", " 'k': -1,\n", " 'l': 'prev'\n", " }\n", ")\n", "display(bdf_fill_missing)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Extended event studies\n", "\n", "BipartitePandas allows you to use *Long* format data to generate event studies with more than 2 periods.\n", "\n", "You can specify:\n", "\n", "- which column signals a transition (e.g. if `j` is used, a transition is when a worker moves firms)\n", "- which column(s) should be treated as the event study outcome\n", "- how many periods before and after the transition should be considered\n", "- whether the pre- and/or post-trends must be stable, and for which column(s)\n", "\n", "We consider an example where `j` is the transition column, `y` is the outcome column, and with pre- and post-trends of length 2 that are required to be at the same firm. Note that `y_f1` is the first observation after the individual moves firms." ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
ity_l2y_l1y_f1y_f2
0722.6601791.3764631.9658432.006426
11521.0337091.6057452.2229071.576384
22830.8147402.2910872.5117611.557221
33332.0975490.3340500.7277950.578112
4352-2.503758-0.2811620.2018400.086961
.....................
2543998820.062628-0.286134-0.183206-1.367690
254499913-3.1053530.6883290.432237-0.346683
254599942-0.650689-3.249975-2.090252-2.282304
2546999520.597124-0.1599671.3805502.604956
254799962-0.621562-0.356489-2.629698-1.638295
\n", "

2548 rows × 6 columns

\n", "
" ], "text/plain": [ " i t y_l2 y_l1 y_f1 y_f2\n", "0 7 2 2.660179 1.376463 1.965843 2.006426\n", "1 15 2 1.033709 1.605745 2.222907 1.576384\n", "2 28 3 0.814740 2.291087 2.511761 1.557221\n", "3 33 3 2.097549 0.334050 0.727795 0.578112\n", "4 35 2 -2.503758 -0.281162 0.201840 0.086961\n", "... ... .. ... ... ... ...\n", "2543 9988 2 0.062628 -0.286134 -0.183206 -1.367690\n", "2544 9991 3 -3.105353 0.688329 0.432237 -0.346683\n", "2545 9994 2 -0.650689 -3.249975 -2.090252 -2.282304\n", "2546 9995 2 0.597124 -0.159967 1.380550 2.604956\n", "2547 9996 2 -0.621562 -0.356489 -2.629698 -1.638295\n", "\n", "[2548 rows x 6 columns]" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "es_extended = bdf.get_extended_eventstudy(\n", " transition_col='j', outcomes='y',\n", " periods_pre=2, periods_post=2,\n", " stable_pre='j', stable_post='j'\n", ")\n", "display(es_extended)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Advanced data simulation\n", "\n", "For details on all simulation parameters, run `bpd.sim_params().describe_all()`, or search through `bpd.sim_params().keys()` for a particular key, and then run `bpd.sim_params().describe(key)`." ] } ], "metadata": { "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.9.2" } }, "nbformat": 4, "nbformat_minor": 4 }