{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Simple example\n", "\n", "## Import the BipartitePandas package\n", "\n", "Make sure to install it using `pip install bipartitepandas`." ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "import bipartitepandas as bpd" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Get your data ready\n", "\n", "For this notebook, we simulate data (we set parameters to make data cleaning interesting)." ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
ijytlkalphapsi
00416-0.9358240240.000000-0.114185
104160.9035351240.000000-0.114185
20416-0.4666742240.000000-0.114185
304160.1635633240.000000-0.114185
404160.6026994240.000000-0.114185
...........................
499959999344-0.781453013-0.430727-0.348756
4999699993440.008461113-0.430727-0.348756
499979999344-0.959677213-0.430727-0.348756
4999899993440.068173313-0.430727-0.348756
499999999344-0.733225413-0.430727-0.348756
\n", "

50000 rows × 8 columns

\n", "
" ], "text/plain": [ " i j y t l k alpha psi\n", "0 0 416 -0.935824 0 2 4 0.000000 -0.114185\n", "1 0 416 0.903535 1 2 4 0.000000 -0.114185\n", "2 0 416 -0.466674 2 2 4 0.000000 -0.114185\n", "3 0 416 0.163563 3 2 4 0.000000 -0.114185\n", "4 0 416 0.602699 4 2 4 0.000000 -0.114185\n", "... ... ... ... .. .. .. ... ...\n", "49995 9999 344 -0.781453 0 1 3 -0.430727 -0.348756\n", "49996 9999 344 0.008461 1 1 3 -0.430727 -0.348756\n", "49997 9999 344 -0.959677 2 1 3 -0.430727 -0.348756\n", "49998 9999 344 0.068173 3 1 3 -0.430727 -0.348756\n", "49999 9999 344 -0.733225 4 1 3 -0.430727 -0.348756\n", "\n", "[50000 rows x 8 columns]" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "df = bpd.SimBipartite(\n", " bpd.sim_params(\n", " {\n", " 'firm_size': 10,\n", " 'p_move': 0.05\n", " }\n", " )\n", ").simulate()\n", "display(df)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Columns\n", "\n", "BipartitePandas includes seven pre-defined general columns:\n", "\n", "#### Required\n", "- `i`: worker id (any type)\n", "- `j`: firm id (any type)\n", "- `y`: income (float or int)\n", "\n", "#### Optional\n", "- `t`: time (int)\n", "- `g`: firm type (any type)\n", "- `w`: weight (float or int)\n", "- `m`: move indicator (int)\n", "\n", "## Constructing DataFrames\n", "\n", "How do we construct a dataframe? Just use the required columns (plus any optional columns you want to include)!" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
ijyt
00416-0.9358240
104160.9035351
20416-0.4666742
304160.1635633
404160.6026994
...............
499959999344-0.7814530
4999699993440.0084611
499979999344-0.9596772
4999899993440.0681733
499999999344-0.7332254
\n", "

50000 rows × 4 columns

\n", "
" ], "text/plain": [ " i j y t\n", "0 0 416 -0.935824 0\n", "1 0 416 0.903535 1\n", "2 0 416 -0.466674 2\n", "3 0 416 0.163563 3\n", "4 0 416 0.602699 4\n", "... ... ... ... ..\n", "49995 9999 344 -0.781453 0\n", "49996 9999 344 0.008461 1\n", "49997 9999 344 -0.959677 2\n", "49998 9999 344 0.068173 3\n", "49999 9999 344 -0.733225 4\n", "\n", "[50000 rows x 4 columns]" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "bdf = bpd.BipartiteDataFrame(\n", " i=df['i'], j=df['j'], y=df['y'], t=df['t']\n", ")\n", "display(bdf)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Now that we have our dataframe, let's check out some summary statistics" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "format: 'BipartiteLong'\n", "number of workers: 10000\n", "number of firms: 999\n", "number of observations: 50000\n", "mean wage: 0.008291372114450858\n", "median wage: 0.010194526935672465\n", "min wage: -5.6066321450713374\n", "max wage: 6.086789605058163\n", "var(wage): 2.64936416420792\n", "no NaN values: False\n", "no duplicates: False\n", "i-t (worker-year) observations unique (None if t column(s) not included): False\n", "no returns (None if not yet computed): None\n", "contiguous 'i' ids (None if not included): False\n", "contiguous 'j' ids (None if not included): False\n", "contiguous 'g' ids (None if not included): None\n", "connectedness (None if ignoring connectedness): None\n" ] } ], "source": [ "bdf.summary()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Let's clean our data - and make sure the result is leave-one-observation-out connected\n", "\n", "
\n", "\n", "Hint\n", "\n", "Want details on all cleaning parameters? Run `bpd.clean_params().describe_all()`, or search through `bpd.clean_params().keys()` for a particular key, and then run `bpd.clean_params().describe(key)`.\n", "\n", "
" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "checking required columns and datatypes\n", "sorting rows\n", "dropping NaN observations\n", "generating 'm' column\n", "keeping highest paying job for i-t (worker-year) duplicates (how='max')\n", "dropping workers who leave a firm then return to it (how=False)\n", "making 'i' ids contiguous\n", "making 'j' ids contiguous\n", "computing largest connected set (how='leave_out_observation')\n", "making 'i' ids contiguous\n", "making 'j' ids contiguous\n", "sorting columns\n", "resetting index\n" ] }, { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
ijytm
0001.30693900
100-0.00559110
200-0.19281320
3002.53721230
4001.75666440
..................
445768962797-0.78145300
4457789627970.00846110
445788962797-0.95967720
4457989627970.06817330
445808962797-0.73322540
\n", "

44581 rows × 5 columns

\n", "
" ], "text/plain": [ " i j y t m\n", "0 0 0 1.306939 0 0\n", "1 0 0 -0.005591 1 0\n", "2 0 0 -0.192813 2 0\n", "3 0 0 2.537212 3 0\n", "4 0 0 1.756664 4 0\n", "... ... ... ... .. ..\n", "44576 8962 797 -0.781453 0 0\n", "44577 8962 797 0.008461 1 0\n", "44578 8962 797 -0.959677 2 0\n", "44579 8962 797 0.068173 3 0\n", "44580 8962 797 -0.733225 4 0\n", "\n", "[44581 rows x 5 columns]" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "bdf = bdf.clean(\n", " bpd.clean_params(\n", " {\n", " 'connectedness': 'leave_out_observation'\n", " }\n", " )\n", ")\n", "display(bdf)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can check how the summary statistics changed:" ] }, { "cell_type": "code", "execution_count": 6, "metadata": { "scrolled": true }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "format: 'BipartiteLong'\n", "number of workers: 8963\n", "number of firms: 874\n", "number of observations: 44581\n", "mean wage: 0.0043933052888950955\n", "median wage: 0.005410443676386101\n", "min wage: -5.6066321450713374\n", "max wage: 6.086789605058163\n", "var(wage): 2.658089932617942\n", "no NaN values: True\n", "no duplicates: True\n", "i-t (worker-year) observations unique (None if t column(s) not included): True\n", "no returns (None if not yet computed): True\n", "contiguous 'i' ids (None if not included): True\n", "contiguous 'j' ids (None if not included): True\n", "contiguous 'g' ids (None if not included): None\n", "connectedness (None if ignoring connectedness): 'leave_out_observation'\n" ] } ], "source": [ "bdf.summary()" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.9.2" } }, "nbformat": 4, "nbformat_minor": 4 }