{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# CRE example"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Add PyTwoWay to system path (do not run this)\n",
    "# import sys\n",
    "# sys.path.append('../../..')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Import the PyTwoWay package\n",
    "\n",
    "Make sure to install it using `pip install pytwoway`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2021-01-15T23:38:19.123052Z",
     "start_time": "2021-01-15T23:38:18.565950Z"
    }
   },
   "outputs": [],
   "source": [
    "import pytwoway as tw\n",
    "import bipartitepandas as bpd"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## First, check out parameter options\n",
    "\n",
    "Do this by running:\n",
    "\n",
    "- CRE - `tw.cre_params().describe_all()`\n",
    "\n",
    "- Clustering - `bpd.cluster_params().describe_all()`\n",
    "\n",
    "- Cleaning - `bpd.clean_params().describe_all()`\n",
    "\n",
    "- Simulating - `bpd.sim_params().describe_all()`\n",
    "\n",
    "Alternatively, run `x_params().keys()` to view all the keys for a parameter dictionary, then `x_params().describe(key)` to get a description for a single key."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Second, set parameter choices\n",
    "\n",
    "<div class=\"alert alert-info\">\n",
    "\n",
    "Note\n",
    "\n",
    "We set `copy=False` in `clean_params` to avoid unnecessary copies (although this may modify the original dataframe).\n",
    "\n",
    "</div>"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [],
   "source": [
    "# CRE\n",
    "cre_params = tw.cre_params()\n",
    "## Clustering ##\n",
    "# Use firm-level cdfs of income as our measure\n",
    "measures = bpd.measures.CDFs()\n",
    "# Group using k-means\n",
    "grouping = bpd.grouping.KMeans()\n",
    "# General clustering\n",
    "cluster_params = bpd.cluster_params(\n",
    "    {\n",
    "        'measures': measures,\n",
    "        'grouping': grouping\n",
    "    }\n",
    ")\n",
    "# Cleaning\n",
    "clean_params = bpd.clean_params(\n",
    "    {\n",
    "        'connectedness': 'leave_out_spell',\n",
    "        'collapse_at_connectedness_measure': True,\n",
    "        'drop_single_stayers': True,\n",
    "        'drop_returns': 'returners',\n",
    "        'copy': False\n",
    "    }\n",
    ")\n",
    "# Simulating\n",
    "sim_params = bpd.sim_params(\n",
    "    {\n",
    "        'n_workers': 1000,\n",
    "        'firm_size': 5,\n",
    "        'alpha_sig': 2, 'w_sig': 2,\n",
    "        'c_sort': 1.5, 'c_netw': 1.5,\n",
    "        'p_move': 0.1\n",
    "    }\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Third, extract data (we simulate for the example)\n",
    "\n",
    "`BipartitePandas` contains the class `SimBipartite` which we use here to simulate a bipartite network. If you have your own data, you can import it during this step. Load it as a `Pandas DataFrame` and then convert it into a `BipartitePandas DataFrame` in the next step."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [],
   "source": [
    "sim_data = bpd.SimBipartite(sim_params).simulate()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Fourth, prepare data\n",
    "\n",
    "This is exactly how you should prepare real data prior to running the CRE estimator.\n",
    "\n",
    "- First, we convert the data into a `BipartitePandas DataFrame`\n",
    "\n",
    "- Second, we clean the data (e.g. drop NaN observations, make sure firm and worker ids are contiguous, construct the leave-one-out connected set, etc.). This also collapses the data at the worker-firm spell level (taking mean wage over the spell), because we set `collapse_at_connectedness_measure=True`.\n",
    "\n",
    "- Third, we cluster firms by their wage distributions, to generate firm classes (columns `g1` and `g2`). Alternatively, manually set the columns `g1` and `g2` to pre-estimated clusters (but make sure to [add them correctly!](https://tlamadon.github.io/bipartitepandas/notebooks/custom_columns.html#Adding-custom-columns-to-an-instantiated-DataFrame)).\n",
    "\n",
    "- Fourth, we convert the data into cross-section format\n",
    "\n",
    "Further details on `BipartitePandas` can be found in the package documentation, available [here](https://tlamadon.github.io/bipartitepandas/).\n",
    "\n",
    "<div class=\"alert alert-info\">\n",
    "\n",
    "Note\n",
    "\n",
    "Since leave-one-out connectedness is not maintained after data is collapsed at the spell/match level, if you set `collapse_at_connectedness_measure=False`, then data must be cleaned WITHOUT taking the leave-one-out set, collapsed at the spell/match level, and then finally the largest leave-one-out connected set can be computed.\n",
    "\n",
    "</div>"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "checking required columns and datatypes\n",
      "sorting rows\n",
      "dropping NaN observations\n",
      "generating 'm' column\n",
      "keeping highest paying job for i-t (worker-year) duplicates (how='max')\n",
      "dropping workers who leave a firm then return to it (how='returners')\n",
      "making 'i' ids contiguous\n",
      "making 'j' ids contiguous\n",
      "computing largest connected set (how=None)\n",
      "sorting columns\n",
      "resetting index\n",
      "checking required columns and datatypes\n",
      "sorting rows\n",
      "generating 'm' column\n",
      "computing largest connected set (how='leave_out_observation')\n",
      "making 'i' ids contiguous\n",
      "making 'j' ids contiguous\n",
      "sorting columns\n",
      "resetting index\n"
     ]
    }
   ],
   "source": [
    "# Convert into BipartitePandas DataFrame\n",
    "bdf = bpd.BipartiteDataFrame(sim_data)\n",
    "# Clean and collapse\n",
    "bdf = bdf.clean(clean_params)\n",
    "# Cluster\n",
    "bdf = bdf.cluster(cluster_params)\n",
    "# Convert to cross-section format\n",
    "bdf_cs = bdf.to_eventstudy(is_sorted=True, copy=False).get_cs(copy=False)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Fifth, initialize and run the estimator"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Initialize CRE estimator\n",
    "cre_estimator = tw.CREEstimator(bdf_cs, cre_params)\n",
    "# Fit CRE estimator\n",
    "cre_estimator.fit()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Finally, investigate the results\n",
    "\n",
    "<!-- Results correspond to:\n",
    "\n",
    "- `var_y`: variance of `y` (income) column\n",
    "- `var_bw`: CRE estimate of variance of `psi` between groups\n",
    "- `cov_bw`: CRE estimate of covariance of `psi` and `alpha` between groups\n",
    "- `var_tot`: CRE estimate of total variance of `psi`\n",
    "- `cov_to`: CRE estimate of total covariance of `psi` and `alpha`\n",
    "\n",
    "where `psi` gives firm effects and `alpha` gives worker effects. -->"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2020-12-22T21:42:51.498849Z",
     "start_time": "2020-12-22T21:42:51.489723Z"
    },
    "scrolled": true
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "{'var_y': 6.700106866872871,\n",
       " 'var_bw': 1.5696696168532582,\n",
       " 'cov_bw': 1.1955111878642923,\n",
       " 'var_tot': 1.442983307938817,\n",
       " 'cov_tot': 1.1647823062712872}"
      ]
     },
     "execution_count": 7,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "cre_estimator.summary"
   ]
  }
 ],
 "metadata": {
  "celltoolbar": "Tags",
  "hide_input": false,
  "kernelspec": {
   "display_name": "Python 3.8.9 64-bit",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.8.9"
  },
  "latex_envs": {
   "LaTeX_envs_menu_present": true,
   "autoclose": false,
   "autocomplete": true,
   "bibliofile": "biblio.bib",
   "cite_by": "apalike",
   "current_citInitial": 1,
   "eqLabelWithNumbers": true,
   "eqNumInitial": 1,
   "hotkeys": {
    "equation": "Ctrl-E",
    "itemize": "Ctrl-I"
   },
   "labels_anchors": false,
   "latex_user_defs": false,
   "report_style_numbering": false,
   "user_envs_cfg": false
  },
  "nbsphinx-toctree": {
   "hidden": true,
   "maxdepth": 1,
   "titlesonly": true
  },
  "toc": {
   "base_numbering": 1,
   "nav_menu": {},
   "number_sections": true,
   "sideBar": true,
   "skip_h1_title": false,
   "title_cell": "Table of Contents",
   "title_sidebar": "Contents",
   "toc_cell": false,
   "toc_position": {},
   "toc_section_display": true,
   "toc_window_display": false
  },
  "varInspector": {
   "cols": {
    "lenName": 16,
    "lenType": 16,
    "lenVar": 40
   },
   "kernels_config": {
    "python": {
     "delete_cmd_postfix": "",
     "delete_cmd_prefix": "del ",
     "library": "var_list.py",
     "varRefreshCmd": "print(var_dic_list())"
    },
    "r": {
     "delete_cmd_postfix": ") ",
     "delete_cmd_prefix": "rm(",
     "library": "var_list.r",
     "varRefreshCmd": "cat(var_dic_list()) "
    }
   },
   "types_to_exclude": [
    "module",
    "function",
    "builtin_function_or_method",
    "instance",
    "_Feature"
   ],
   "window_display": false
  },
  "vscode": {
   "interpreter": {
    "hash": "31f2aee4e71d21fbe5cf8b01ff0e069b9275f58929596ceb00d14d90e3e16cd6"
   }
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}