{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# CRE example" ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "# Add PyTwoWay to system path (do not run this)\n", "# import sys\n", "# sys.path.append('../../..')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Import the PyTwoWay package\n", "\n", "Make sure to install it using `pip install pytwoway`." ] }, { "cell_type": "code", "execution_count": 2, "metadata": { "ExecuteTime": { "end_time": "2021-01-15T23:38:19.123052Z", "start_time": "2021-01-15T23:38:18.565950Z" } }, "outputs": [], "source": [ "import pytwoway as tw\n", "import bipartitepandas as bpd" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## First, check out parameter options\n", "\n", "Do this by running:\n", "\n", "- CRE - `tw.cre_params().describe_all()`\n", "\n", "- Clustering - `bpd.cluster_params().describe_all()`\n", "\n", "- Cleaning - `bpd.clean_params().describe_all()`\n", "\n", "- Simulating - `bpd.sim_params().describe_all()`\n", "\n", "Alternatively, run `x_params().keys()` to view all the keys for a parameter dictionary, then `x_params().describe(key)` to get a description for a single key." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Second, set parameter choices\n", "\n", "
\n", "\n", "Note\n", "\n", "We set `copy=False` in `clean_params` to avoid unnecessary copies (although this may modify the original dataframe).\n", "\n", "
" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [], "source": [ "# CRE\n", "cre_params = tw.cre_params()\n", "## Clustering ##\n", "# Use firm-level cdfs of income as our measure\n", "measures = bpd.measures.CDFs()\n", "# Group using k-means\n", "grouping = bpd.grouping.KMeans()\n", "# General clustering\n", "cluster_params = bpd.cluster_params(\n", " {\n", " 'measures': measures,\n", " 'grouping': grouping\n", " }\n", ")\n", "# Cleaning\n", "clean_params = bpd.clean_params(\n", " {\n", " 'connectedness': 'leave_out_spell',\n", " 'collapse_at_connectedness_measure': True,\n", " 'drop_single_stayers': True,\n", " 'drop_returns': 'returners',\n", " 'copy': False\n", " }\n", ")\n", "# Simulating\n", "sim_params = bpd.sim_params(\n", " {\n", " 'n_workers': 1000,\n", " 'firm_size': 5,\n", " 'alpha_sig': 2, 'w_sig': 2,\n", " 'c_sort': 1.5, 'c_netw': 1.5,\n", " 'p_move': 0.1\n", " }\n", ")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Third, extract data (we simulate for the example)\n", "\n", "`BipartitePandas` contains the class `SimBipartite` which we use here to simulate a bipartite network. If you have your own data, you can import it during this step. Load it as a `Pandas DataFrame` and then convert it into a `BipartitePandas DataFrame` in the next step." ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [], "source": [ "sim_data = bpd.SimBipartite(sim_params).simulate()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Fourth, prepare data\n", "\n", "This is exactly how you should prepare real data prior to running the CRE estimator.\n", "\n", "- First, we convert the data into a `BipartitePandas DataFrame`\n", "\n", "- Second, we clean the data (e.g. drop NaN observations, make sure firm and worker ids are contiguous, construct the leave-one-out connected set, etc.). This also collapses the data at the worker-firm spell level (taking mean wage over the spell), because we set `collapse_at_connectedness_measure=True`.\n", "\n", "- Third, we cluster firms by their wage distributions, to generate firm classes (columns `g1` and `g2`). Alternatively, manually set the columns `g1` and `g2` to pre-estimated clusters (but make sure to [add them correctly!](https://tlamadon.github.io/bipartitepandas/notebooks/custom_columns.html#Adding-custom-columns-to-an-instantiated-DataFrame)).\n", "\n", "- Fourth, we convert the data into cross-section format\n", "\n", "Further details on `BipartitePandas` can be found in the package documentation, available [here](https://tlamadon.github.io/bipartitepandas/).\n", "\n", "
\n", "\n", "Note\n", "\n", "Since leave-one-out connectedness is not maintained after data is collapsed at the spell/match level, if you set `collapse_at_connectedness_measure=False`, then data must be cleaned WITHOUT taking the leave-one-out set, collapsed at the spell/match level, and then finally the largest leave-one-out connected set can be computed.\n", "\n", "
" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "checking required columns and datatypes\n", "sorting rows\n", "dropping NaN observations\n", "generating 'm' column\n", "keeping highest paying job for i-t (worker-year) duplicates (how='max')\n", "dropping workers who leave a firm then return to it (how='returners')\n", "making 'i' ids contiguous\n", "making 'j' ids contiguous\n", "computing largest connected set (how=None)\n", "sorting columns\n", "resetting index\n", "checking required columns and datatypes\n", "sorting rows\n", "generating 'm' column\n", "computing largest connected set (how='leave_out_observation')\n", "making 'i' ids contiguous\n", "making 'j' ids contiguous\n", "sorting columns\n", "resetting index\n" ] } ], "source": [ "# Convert into BipartitePandas DataFrame\n", "bdf = bpd.BipartiteDataFrame(sim_data)\n", "# Clean and collapse\n", "bdf = bdf.clean(clean_params)\n", "# Cluster\n", "bdf = bdf.cluster(cluster_params)\n", "# Convert to cross-section format\n", "bdf_cs = bdf.to_eventstudy(is_sorted=True, copy=False).get_cs(copy=False)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Fifth, initialize and run the estimator" ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [], "source": [ "# Initialize CRE estimator\n", "cre_estimator = tw.CREEstimator(bdf_cs, cre_params)\n", "# Fit CRE estimator\n", "cre_estimator.fit()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Finally, investigate the results\n", "\n", "" ] }, { "cell_type": "code", "execution_count": 7, "metadata": { "ExecuteTime": { "end_time": "2020-12-22T21:42:51.498849Z", "start_time": "2020-12-22T21:42:51.489723Z" }, "scrolled": true }, "outputs": [ { "data": { "text/plain": [ "{'var_y': 6.700106866872871,\n", " 'var_bw': 1.5696696168532582,\n", " 'cov_bw': 1.1955111878642923,\n", " 'var_tot': 1.442983307938817,\n", " 'cov_tot': 1.1647823062712872}" ] }, "execution_count": 7, "metadata": {}, "output_type": "execute_result" } ], "source": [ "cre_estimator.summary" ] } ], "metadata": { "celltoolbar": "Tags", "hide_input": false, "kernelspec": { "display_name": "Python 3.8.9 64-bit", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.8.9" }, "latex_envs": { "LaTeX_envs_menu_present": true, "autoclose": false, "autocomplete": true, "bibliofile": "biblio.bib", "cite_by": "apalike", "current_citInitial": 1, "eqLabelWithNumbers": true, "eqNumInitial": 1, "hotkeys": { "equation": "Ctrl-E", "itemize": "Ctrl-I" }, "labels_anchors": false, "latex_user_defs": false, "report_style_numbering": false, "user_envs_cfg": false }, "nbsphinx-toctree": { "hidden": true, "maxdepth": 1, "titlesonly": true }, "toc": { "base_numbering": 1, "nav_menu": {}, "number_sections": true, "sideBar": true, "skip_h1_title": false, "title_cell": "Table of Contents", "title_sidebar": "Contents", "toc_cell": false, "toc_position": {}, "toc_section_display": true, "toc_window_display": false }, "varInspector": { "cols": { "lenName": 16, "lenType": 16, "lenVar": 40 }, "kernels_config": { "python": { "delete_cmd_postfix": "", "delete_cmd_prefix": "del ", "library": "var_list.py", "varRefreshCmd": "print(var_dic_list())" }, "r": { "delete_cmd_postfix": ") ", "delete_cmd_prefix": "rm(", "library": "var_list.r", "varRefreshCmd": "cat(var_dic_list()) " } }, "types_to_exclude": [ "module", "function", "builtin_function_or_method", "instance", "_Feature" ], "window_display": false }, "vscode": { "interpreter": { "hash": "31f2aee4e71d21fbe5cf8b01ff0e069b9275f58929596ceb00d14d90e3e16cd6" } } }, "nbformat": 4, "nbformat_minor": 4 }