BLM example

[1]:
# Add PyTwoWay to system path (do not run this)
# import sys
# sys.path.append('../../..')

Import the PyTwoWay package

Make sure to install it using pip install pytwoway.

[2]:
import numpy as np
import bipartitepandas as bpd
import pytwoway as tw
from pytwoway import constraints as cons
from matplotlib import pyplot as plt

First, check out parameter options

Do this by running:

  • BLM - tw.blm_params().describe_all()

  • Clustering - bpd.cluster_params().describe_all()

  • Cleaning - bpd.clean_params().describe_all()

  • Simulating - bpd.sim_params().describe_all()

Alternatively, run x_params().keys() to view all the keys for a parameter dictionary, then x_params().describe(key) to get a description for a single key.

Second, set parameter choices

[3]:
nl = 3 # Number of worker types
nk = 4 # Number of firm types
blm_params = tw.blm_params({
    'nl': nl, 'nk': nk,
    'a1_mu': 0, 'a1_sig': 1.5, 'a2_mu': 0, 'a2_sig': 1.5,
    's1_low': 0.5, 's1_high': 1.5, 's2_low': 0.5, 's2_high': 1.5
})
cluster_params = bpd.cluster_params({
    'measures': bpd.measures.CDFs(),
    'grouping': bpd.grouping.KMeans(n_clusters=nk),
    'is_sorted': True,
    'copy': False
})
clean_params = bpd.clean_params({
    'drop_returns': 'returners',
    'copy': False
})
sim_params = bpd.sim_params({
    'nl': nl, 'nk': nk,
    'alpha_sig': 1, 'psi_sig': 1, 'w_sig': 0.6,
    'c_sort': 0, 'c_netw': 0, 'c_sig': 1
})

Extract data (we simulate for the example)

BipartitePandas contains the class SimBipartite which we use here to simulate a bipartite network. If you have your own data, you can import it during this step. Load it as a Pandas DataFrame and then convert it into a BipartitePandas DataFrame in the next step.

Note that l gives the true worker type and k gives the true firm type, while alpha gives the true worker effect and psi gives the true firm effect. We will save these columns separately.

The BLM estimator uses the firm types computed via clustering, which are saved in columns g1 and g2.

[4]:
sim_data = bpd.SimBipartite(sim_params).simulate()

Prepare data

This is exactly how you should prepare real data prior to running the BLM estimator.

  • First, we convert the data into a BipartitePandas DataFrame

  • Second, we clean the data (e.g. drop NaN observations, make sure firm and worker ids are contiguous, etc.)

  • Third, we cluster firms by their wage distributions, to generate firm classes (columns g1 and g2). Alternatively, manually set the columns g1 and g2 to pre-estimated clusters (but make sure to add them correctly!).

  • Fourth, we collapse the data at the worker-firm spell level (taking mean wage over the spell)

  • Fifth, we convert the data into event study format

Further details on BipartitePandas can be found in the package documentation, available here.

[5]:
# Convert into BipartitePandas DataFrame
sim_data = bpd.BipartiteDataFrame(sim_data)
# Clean and collapse
sim_data = sim_data.clean(clean_params).collapse(is_sorted=True, copy=False)
# Cluster
sim_data = sim_data.cluster(cluster_params)
# Convert to event study format
sim_data = sim_data.to_eventstudy(is_sorted=True, copy=False)
checking required columns and datatypes
sorting rows
dropping NaN observations
generating 'm' column
keeping highest paying job for i-t (worker-year) duplicates (how='max')
dropping workers who leave a firm then return to it (how='returners')
making 'i' ids contiguous
making 'j' ids contiguous
computing largest connected set (how=None)
sorting columns
resetting index

Separate observed and unobserved data

Some of the simulated data is not observed, so we separate out that data during estimation.

[6]:
sim_data_observed = sim_data.loc[:, ['i', 'j1', 'j2', 'y1', 'y2', 't11', 't12', 't21', 't22', 'g1', 'g2', 'w1', 'w2', 'm']]
sim_data_unobserved = sim_data.loc[:, ['alpha1', 'alpha2', 'k1', 'k2', 'l1', 'l2', 'psi1', 'psi2']]

Separate movers and stayers data

We need to distinguish movers and stayers for the estimator.

[7]:
movers = sim_data_observed.get_worker_m(is_sorted=True)
jdata = sim_data_observed.loc[movers, :]
sdata = sim_data_observed.loc[~movers, :]

Check the data

Let’s check the cleaned data.

[8]:
print('Movers data')
display(jdata)
print('Stayers data')
display(sdata)
Movers data
i j1 j2 y1 y2 t11 t12 t21 t22 g1 g2 w1 w2 m
0 0 189 141 0.658481 -0.302737 0 0 1 3 0 3 1 3 1
1 0 141 179 -0.302737 -0.070679 1 3 4 4 3 0 3 1 1
2 1 3 178 -0.977635 -0.138164 0 0 1 4 1 0 1 4 1
3 2 179 9 0.273751 -2.872666 0 2 3 3 0 1 3 1 1
4 2 9 113 -2.872666 -1.395046 3 3 4 4 1 3 1 1 1
... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
20331 9923 5 18 -2.439280 -2.086203 0 0 1 1 1 1 1 1 1
20332 9923 18 186 -2.086203 0.518828 1 1 2 2 1 0 1 1 1
20333 9923 186 92 0.518828 0.440968 2 2 3 3 0 2 1 1 1
20334 9923 92 14 0.440968 -1.070705 3 3 4 4 2 1 1 1 1
20335 9924 167 42 1.730241 -0.256907 0 0 1 4 0 1 1 4 1

19695 rows × 14 columns

Stayers data
i j1 j2 y1 y2 t11 t12 t21 t22 g1 g2 w1 w2 m
16 9 183 183 1.775003 1.775003 0 4 0 4 0 0 5 5 0
101 55 36 36 -0.154698 -0.154698 0 4 0 4 1 1 5 5 0
127 65 192 192 0.476397 0.476397 0 4 0 4 0 0 5 5 0
131 67 123 123 -0.268389 -0.268389 0 4 0 4 3 3 5 5 0
152 78 15 15 -0.318783 -0.318783 0 4 0 4 1 1 5 5 0
... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
20111 9820 177 177 0.865980 0.865980 0 4 0 4 0 0 5 5 0
20155 9840 111 111 1.149045 1.149045 0 4 0 4 3 3 5 5 0
20213 9866 187 187 1.390672 1.390672 0 4 0 4 0 0 5 5 0
20259 9885 17 17 -1.266577 -1.266577 0 4 0 4 1 1 5 5 0
20272 9892 47 47 0.795123 0.795123 0 4 0 4 2 2 5 5 0

641 rows × 14 columns

Initialize and run BLMEstimator

Note

The BLMEstimator class requires data to be formatted as a BipartitePandas DataFrame.

[ ]:
# Initialize BLM estimator
blm_fit = tw.BLMEstimator(blm_params)
# Fit BLM estimator
blm_fit.fit(jdata=jdata, sdata=sdata, n_init=40, n_best=5, ncore=8)
# Store best model
blm_model = blm_fit.model

Check that log-likelihoods are monotonic

[10]:
liks1 = blm_model.liks1

print('Log-likelihoods monotonic (movers):', np.min(np.diff(liks1)) >= 0)

x_axis = range(1, len(liks1))
plt.plot(x_axis, np.diff(liks1), '.', label='liks1', color='red')
plt.plot(x_axis, np.zeros(len(liks1) - 1))
plt.xlabel('Iteration')
plt.ylabel('Change in log-likelihood')
Log-likelihoods monotonic (movers): True
[10]:
Text(0, 0.5, 'Change in log-likelihood')
../_images/notebooks_blm_example_20_2.png
[11]:
liks0 = blm_model.liks0

print('Log-likelihoods monotonic (stayers):', np.min(np.diff(liks0)) >= 0)

x_axis = range(1, len(liks0))
plt.plot(x_axis, np.diff(liks0), '.', label='liks0', color='red')
plt.plot(x_axis, np.zeros(len(liks0) - 1))
plt.xlabel('Iteration')
plt.ylabel('Change in log-likelihood')
Log-likelihoods monotonic (stayers): True
[11]:
Text(0, 0.5, 'Change in log-likelihood')
../_images/notebooks_blm_example_21_2.png

Now we can investigate the results

[12]:
# Plot likelihood vs. connectedness
blm_fit.plot_liks_connectedness()
../_images/notebooks_blm_example_23_0.png
[13]:
blm_fit.plot_log_earnings()
../_images/notebooks_blm_example_24_0.png
[14]:
blm_fit.plot_type_proportions()
../_images/notebooks_blm_example_25_0.png

Finally, we can compare estimates to the truth

[15]:
# Compute true parameters
true_A1 = np.expand_dims(sim_data_unobserved.groupby('l1')['alpha1'].mean().to_numpy(), 1) + np.tile(sim_data_unobserved.groupby('k1')['psi1'].mean().to_numpy(), (nl, 1))
true_A2 = np.expand_dims(sim_data_unobserved.groupby('l2')['alpha2'].mean().to_numpy(), 1) + np.tile(sim_data_unobserved.groupby('k2')['psi2'].mean().to_numpy(), (nl, 1))

# Sort estimated parameters (this is because the firm type order generated by clustering is random - this is automatically handled in the built-in plotting functions)
A1, A2 = blm_model._sort_parameters(blm_model.A1, blm_model.A2, sort_firm_types=True)
[16]:
plt.plot(true_A1.flatten(), A1.flatten(), '.', label='A1', color='red')
plt.plot(true_A2.flatten(), A2.flatten(), '.', label='A2', color='green')
plt.xlabel('Truth')
plt.ylabel('Estimate')
plt.legend()
plt.grid()
plt.show()
../_images/notebooks_blm_example_28_0.png

Bootstrapped errors

The package contains the class BLMBootstrap, which allows for the construction of bootstrapped error bars.

In this example, we use the same simulated data and estimation parameters as in the previous example.

[ ]:
# Initialize BLM bootstrap estimator
blm_fit = tw.BLMBootstrap(blm_params)
# Fit BLM estimator
blm_fit.fit(
    jdata=jdata, sdata=sdata,
    blm_model=blm_model,
    n_samples=20,
    cluster_params=cluster_params,
    ncore=8,
    verbose=False
)

Now we can investigate the results

[18]:
blm_fit.plot_log_earnings()
../_images/notebooks_blm_example_32_0.png
[19]:
blm_fit.plot_type_proportions()
../_images/notebooks_blm_example_33_0.png

Control variables

Hint

Check out how to add custom columns to your BipartitePandas DataFrame here! If you don’t add custom columns properly, they may not be handled during data cleaning and estimation how you want and/or expect!

Simulating data with controls

The package contains functions to simulate data from the BLM dgp. We use this here to see how to use control variables.

In this example, we simulate a categorical, non-stationary, non-worker-type-interaction control variable. We use a low variance to ensure stability of the estimator.

Check out simulation parameter options

Do this by running:

  • Simulating Categorical Controls - tw.sim_categorical_control_params().describe_all()

  • Simulating Continuous Controls - tw.sim_continuous_control_params().describe_all()

Alternatively, run x_params().keys() to view all the keys for a parameter dictionary, then x_params().describe(key) to get a description for a single key.

Set simulation parameter choices

[20]:
n_control = 2 # Number of types for control variable

sim_cat_params = tw.sim_categorical_control_params({
    'n': n_control,
    'stationary_A': False, 'stationary_S': False,
    'worker_type_interaction': False,
    'a1_mu': -0.5, 'a1_sig': 0.25, 'a2_mu': 0.5, 'a2_sig': 0.25,
    's1_low': 0, 's1_high': 0.01, 's2_low': 0, 's2_high': 0.01
})
sim_cts_params = tw.sim_continuous_control_params({
    'stationary_A': True, 'stationary_S': True,
    'worker_type_interaction': True,
    'a1_mu': -0.15, 'a1_sig': 0.05, 'a2_mu': 0.15, 'a2_sig': 0.05,
    's1_low': 0, 's1_high': 0.01, 's2_low': 0, 's2_high': 0.01
})
sim_blm_params = tw.sim_blm_params({
    'nl': nl, 'nk': nk,
    'a1_mu': -2, 'a1_sig': 0.25, 'a2_mu': 2, 'a2_sig': 0.25,
    's1_low': 0, 's1_high': 0.01, 's2_low': 0, 's2_high': 0.01,
    'categorical_controls': {'cat_control': sim_cat_params},
    'continuous_controls': {'cts_control': sim_cts_params}
})

Estimating control variables

PyTwoWay allows for categorical and continuous control variables.

Categorical

To define a categorical control variable, construct an instance of the parameter dictionary tw.categorical_control_params().

Then, inside an instance of blm_params(), link the key 'categorical_controls' to a dictionary linking column names to their associated parameter dictionaries.

Continuous

To define a continuous control variable, construct an instance of the parameter dictionary tw.continuous_control_params().

Then, inside an instance of blm_params(), link the key 'continuous_controls' to a dictionary linking column names to their associated parameter dictionaries.

Constraints

Constraints can be specified for control variables. They are set through the cons_a and cons_s keys for a given control variable’s parameter dictionary, except for NoWorkerTypeInteraction(). Constraint options are: - Linear() - for a fixed firm type, worker types effects must change linearly - Monotonic() - for a fixed firm type, worker types effects must increase (or decrease) monotonically - NoWorkerTypeInteraction() - for a fixed firm type, worker types effects must all be the same - Stationary() - worker-firm pair effects are the same in all periods - StationaryFirmTypeVariation() - firm type induced variation of worker-firm pair effects is the same in all periods. In particular, this is equivalent to setting A2 = (np.mean(A2, axis=1) + A1.T - np.mean(A1, axis=1)).T. - BoundedBelow() - worker-firm pair effects are bounded below - BoundedAbove() - worker-firm pair effects are bounded above

NoWorkerTypeInteraction() is not specified through the cons_a and cons_s keys. Instead, if you want a control variable to interact with the unobserved worker types, this can be specified by setting worker_type_interaction=True in the control variable’s parameter dictionary.

Warning

Be careful when adding constraints to categorical control variables (note that not interacting with worker types requires constraints). Because categorical control variables require normalization, the choice of how to normalize can alter parameter estimates. This may raise a warning if the normalization gets stuck in a cycle of changing how to normalize every loop and the estimator is not converging. The estimator will work if you set force_min_firm_type=True in your BLM parameter dictionary - the warning will also provide a reminder of this option.

Check out estimation parameter options

Do this by running:

  • Estimating Categorical Controls - tw.categorical_control_params().describe_all()

  • Estimating Continuous Controls - tw.continuous_control_params().describe_all()

Alternatively, run x_params().keys() to view all the keys for a parameter dictionary, then x_params().describe(key) to get a description for a single key.

Set estimation parameter choices

[21]:
cat_params = tw.categorical_control_params({
    'n': n_control,
    'worker_type_interaction': False,
    'cons_a': None, 'cons_s': None,
    'a1_mu': -0.5, 'a1_sig': 0.25, 'a2_mu': 0.5, 'a2_sig': 0.25,
    's1_low': 0, 's1_high': 0.01, 's2_low': 0, 's2_high': 0.01
})
cts_params = tw.continuous_control_params({
    'worker_type_interaction': True,
    'cons_a': cons.Stationary(), 'cons_s': cons.Stationary(),
    'a1_mu': -0.15, 'a1_sig': 0.05, 'a2_mu': 0.15, 'a2_sig': 0.05,
    's1_low': 0, 's1_high': 0.01, 's2_low': 0, 's2_high': 0.01
})
blm_params = tw.blm_params({
    'nl': nl, 'nk': nk,
    'a1_mu': -2, 'a1_sig': 0.25, 'a2_mu': 2, 'a2_sig': 0.25,
    's1_low': 0, 's1_high': 0.01, 's2_low': 0, 's2_high': 0.01,
    'categorical_controls': {'cat_control': cat_params},
    'continuous_controls': {'cts_control': cts_params},
    'force_min_firm_type': True
})

Simulate data

sim_data gives a dictionary where the key 'jdata' gives simulated mover data and the key 'sdata' gives simulated stayer data.

sim_params gives a dictionary that links each type of control variable to the simulated parameter values for that type.

[22]:
blm_true = tw.SimBLM(sim_blm_params)
sim_data, sim_params = blm_true.simulate(return_parameters=True)
jdata, sdata = sim_data['jdata'], sim_data['sdata']
[23]:
print('Movers data')
display(jdata)
print('Stayers data')
display(sdata)
print('Simulated parameter values')
display(sim_params)
Movers data
i j1 j2 y1 y2 g1 g2 m cat_control1 cat_control2 cts_control1 cts_control2 l
0 0 0 1 -1.405170 2.884550 0 0 1 0 1 -1.597127 -0.863375 1
1 1 0 1 -2.944559 2.699658 0 0 1 1 1 0.418682 0.439516 1
2 2 1 0 -1.283777 2.810403 0 0 1 0 1 -2.397280 -0.507434 1
3 3 0 1 -2.674142 2.781658 0 0 1 1 1 -1.445132 -0.133028 1
4 4 0 1 -2.810683 2.110046 0 0 1 1 0 -0.464561 1.254681 1
... ... ... ... ... ... ... ... ... ... ... ... ... ...
155 155 7 6 -1.076676 1.609433 3 3 1 0 0 -0.801424 1.277516 2
156 156 6 7 -2.718684 1.932711 3 3 1 1 0 0.094787 -0.568200 0
157 157 6 7 -2.753058 1.655190 3 3 1 1 0 0.292111 1.155452 0
158 158 6 7 -2.889319 2.357755 3 3 1 1 1 1.194183 -0.255207 0
159 159 7 6 -1.292902 2.245363 3 3 1 0 0 0.035616 1.193493 1

160 rows × 13 columns

Stayers data
i j1 j2 y1 y2 g1 g2 m cat_control1 cat_control2 cts_control1 cts_control2 l
0 160 0 0 -3.065299 2.939253 0 0 0 1 1 1.280102 -1.276638 1
1 161 1 1 -3.308012 1.882332 0 0 0 1 0 0.673731 0.978745 2
2 162 1 1 -2.048957 2.500295 0 0 0 0 1 0.188291 -0.354548 0
3 163 1 1 -1.647712 3.035064 0 0 0 0 1 0.224408 -1.773956 1
4 164 0 0 -1.753908 2.028478 0 0 0 0 0 -0.307819 0.595909 2
5 165 1 1 -1.838391 3.205352 0 0 0 0 1 -0.034964 -1.680496 2
6 166 0 0 -1.825758 2.394916 0 0 0 0 1 -0.045699 0.869568 2
7 167 0 0 -3.162265 3.055979 0 0 0 1 1 0.174373 -1.324212 2
8 168 0 0 -2.035432 2.368199 0 0 0 0 1 0.227607 0.600801 0
9 169 0 0 -2.933966 2.556052 0 0 0 1 1 0.343777 1.315978 1
10 170 2 2 -1.441650 2.437222 1 1 0 0 1 -0.352137 -0.029504 1
11 171 3 3 -1.681946 2.603666 1 1 0 0 1 0.797169 -0.100233 0
12 172 3 3 -1.446450 2.115750 1 1 0 0 0 -0.856424 -0.023039 0
13 173 3 3 -3.140376 2.507038 1 1 0 1 1 2.209182 0.660209 0
14 174 2 2 -2.743617 2.046139 1 1 0 1 0 0.038795 -0.846242 1
15 175 3 3 -1.253390 1.997722 1 1 0 0 0 -1.620871 -0.369944 1
16 176 3 3 -2.042038 1.745926 1 1 0 1 0 -1.510300 0.903805 2
17 177 3 3 -1.678088 2.148858 1 1 0 0 0 0.876064 -0.305209 0
18 178 2 2 -1.539851 2.806150 1 1 0 0 1 -0.193820 -1.547240 0
19 179 2 2 -1.484411 2.540243 1 1 0 0 1 0.029680 -0.694832 1
20 180 5 5 -1.145073 1.938436 2 2 0 0 1 0.448419 1.062856 2
21 181 5 5 -1.976578 2.167116 2 2 0 0 0 0.051553 0.623657 0
22 182 5 5 -2.248826 2.782804 2 2 0 1 1 -0.069256 -1.654039 2
23 183 4 4 -3.381418 2.342920 2 2 0 1 0 0.835689 -0.528370 0
24 184 5 5 -2.414065 1.897629 2 2 0 1 0 -0.123666 0.956612 1
25 185 4 4 -3.131777 2.884738 2 2 0 1 1 -0.767185 -0.904343 0
26 186 4 4 -3.342341 2.693796 2 2 0 1 1 0.726720 0.368079 0
27 187 5 5 -2.140324 2.530651 2 2 0 0 0 1.183305 -1.663274 0
28 188 4 4 -3.219936 2.662715 2 2 0 1 1 -0.267946 0.635036 0
29 189 4 4 -2.813351 2.204771 2 2 0 1 0 1.858780 -1.423794 2
30 190 6 6 -2.627046 2.341005 3 3 0 1 1 -0.732658 -0.174954 0
31 191 7 7 -1.475745 1.854227 3 3 0 0 0 0.085931 -0.259650 0
32 192 7 7 -1.372760 1.618090 3 3 0 0 0 0.247395 1.038758 2
33 193 6 6 -1.396256 1.879039 3 3 0 0 0 -0.298457 -0.415220 0
34 194 6 6 -1.349778 2.194202 3 3 0 0 1 -0.641476 0.840362 0
35 195 7 7 -1.435867 1.505834 3 3 0 0 0 0.455100 1.541480 2
36 196 6 6 -2.786497 1.745557 3 3 0 1 0 0.631748 0.769990 0
37 197 7 7 -1.450763 1.851792 3 3 0 0 0 -0.047915 -0.048053 0
38 198 6 6 -2.567547 1.749243 3 3 0 1 0 -0.018666 0.676466 2
39 199 7 7 -2.342840 2.728105 3 3 0 1 1 -1.548878 1.056820 1
Simulated parameter values
{'A1': array([[-2.40434291, -1.93230882, -2.3641309 , -1.81931006],
        [-1.99649895, -1.8539811 , -1.54271365, -1.67049293],
        [-2.21882556, -1.59916272, -1.37328134, -1.69013257]]),
 'A2': array([[1.85490342, 1.97296398, 2.15268324, 1.7112059 ],
        [2.15476764, 1.81388297, 1.90973225, 2.26752212],
        [2.0604013 , 1.90970787, 1.65698835, 1.84848259]]),
 'S1': array([[6.24733520e-03, 3.25013934e-04, 9.95248435e-03, 7.88841644e-03],
        [5.06607130e-04, 5.60070089e-05, 5.12086700e-03, 8.94274499e-03],
        [1.15689335e-03, 3.52890499e-03, 3.17502080e-03, 9.56078356e-03]]),
 'S2': array([[0.00106304, 0.00627278, 0.0096796 , 0.00626775],
        [0.00905057, 0.00542163, 0.00412828, 0.00967574],
        [0.00684023, 0.00786226, 0.00925993, 0.00867167]]),
 'pk1': array([[0.22439344, 0.64001715, 0.13558941],
        [0.30791439, 0.16101403, 0.53107158],
        [0.13470115, 0.14758551, 0.71771334],
        [0.15657995, 0.31968694, 0.5237331 ],
        [0.08722402, 0.82393394, 0.08884205],
        [0.81811766, 0.03811832, 0.14376401],
        [0.13887305, 0.03062248, 0.83050447],
        [0.44799947, 0.31514629, 0.23685424],
        [0.42634457, 0.25005179, 0.32360364],
        [0.41872874, 0.15005459, 0.43121667],
        [0.36605   , 0.26067997, 0.37327004],
        [0.38861717, 0.46752657, 0.14385626],
        [0.81362141, 0.10810991, 0.07826868],
        [0.66945685, 0.10447555, 0.2260676 ],
        [0.29082898, 0.26732746, 0.44184356],
        [0.28325089, 0.55597618, 0.16077293]]),
 'pk0': array([[0.14716973, 0.33066942, 0.52216085],
        [0.6236124 , 0.33471458, 0.04167301],
        [0.34365961, 0.23123918, 0.42510121],
        [0.71334724, 0.06543533, 0.22121742]]),
 'A1_cat': {'cat_control': array([ 0.37500106, -0.88596277])},
 'A2_cat': {'cat_control': array([0.12246111, 0.60852562])},
 'S1_cat': {'cat_control': array([0.00785542, 0.00410917])},
 'S2_cat': {'cat_control': array([0.00749964, 0.00787043])},
 'A1_cts': {'cts_control': array([-0.14213511, -0.14125649, -0.30366185])},
 'A2_cts': {'cts_control': array([-0.14213511, -0.14125649, -0.30366185])},
 'S1_cts': {'cts_control': array([0.00774112, 0.00157575, 0.00477038])},
 'S2_cts': {'cts_control': array([0.00774112, 0.00157575, 0.00477038])}}

Initialize and run BLMEstimator

[ ]:
# Initialize BLM estimator
blm_fit = tw.BLMEstimator(blm_params)
# Fit BLM estimator
blm_fit.fit(jdata=jdata, sdata=sdata, n_init=80, n_best=5, ncore=8)
# Store best model
blm_model = blm_fit.model

Check that log-likelihoods are monotonic

[25]:
liks1 = blm_model.liks1[1:]

print('Log-likelihoods monotonic (movers):', np.min(np.diff(liks1)) >= 0)

x_axis = range(1, len(liks1))
plt.plot(x_axis, np.diff(liks1), '.', label='liks1', color='red')
plt.plot(x_axis, np.zeros(len(liks1) - 1))
plt.xlabel('Iteration')
plt.ylabel('Change in log-likelihood')
Log-likelihoods monotonic (movers): True
[25]:
Text(0, 0.5, 'Change in log-likelihood')
../_images/notebooks_blm_example_46_2.png
[26]:
liks0 = blm_model.liks0

print('Log-likelihoods monotonic (stayers):', np.min(np.diff(liks0)) >= 0)

x_axis = range(1, len(liks0))
plt.plot(x_axis, np.diff(liks0), '.', label='liks0', color='red')
plt.plot(x_axis, np.zeros(len(liks0) - 1))
plt.xlabel('Iteration')
plt.ylabel('Change in log-likelihood')
Log-likelihoods monotonic (stayers): True
[26]:
Text(0, 0.5, 'Change in log-likelihood')
../_images/notebooks_blm_example_47_2.png

Now we can investigate the results

[27]:
# Plot likelihood vs. connectedness
blm_fit.plot_liks_connectedness()
../_images/notebooks_blm_example_49_0.png
[28]:
blm_fit.plot_log_earnings()
../_images/notebooks_blm_example_50_0.png
[29]:
blm_fit.plot_type_proportions()
../_images/notebooks_blm_example_51_0.png

Finally, we can compare estimates to the truth

[30]:
plt.plot(sim_params['A1'].flatten(), blm_model.A1.flatten(), '.', label='A1', color='red')
plt.plot(sim_params['A2'].flatten(), blm_model.A2.flatten(), '.', label='A2', color='green')
plt.xlabel('Truth')
plt.ylabel('Estimate')
plt.legend()
plt.grid()
plt.show()

plt.plot(sim_params['S1'].flatten(), blm_model.S1.flatten(), '.', label='S1', color='red')
plt.plot(sim_params['S2'].flatten(), blm_model.S2.flatten(), '.', label='S2', color='green')
plt.xlabel('Truth')
plt.ylabel('Estimate')
plt.legend()
plt.grid()
plt.show()

plt.plot(sim_params['A1_cat']['cat_control'].flatten(), blm_model.A1_cat['cat_control'].flatten(), '.', label='A1_cat', color='red')
plt.plot(sim_params['A2_cat']['cat_control'].flatten(), blm_model.A2_cat['cat_control'].flatten(), '.', label='A2_cat', color='green')
plt.xlabel('Truth')
plt.ylabel('Estimate')
plt.legend()
plt.grid()
plt.show()

plt.plot(sim_params['S1_cat']['cat_control'].flatten(), blm_model.S1_cat['cat_control'].flatten(), '.', label='S1_cat', color='red')
plt.plot(sim_params['S2_cat']['cat_control'].flatten(), blm_model.S2_cat['cat_control'].flatten(), '.', label='S2_cat', color='green')
plt.xlabel('Truth')
plt.ylabel('Estimate')
plt.legend()
plt.grid()
plt.show()

plt.plot(sim_params['A1_cts']['cts_control'].flatten(), blm_model.A1_cts['cts_control'].flatten(), '.', label='A1_cts', color='red')
plt.plot(sim_params['A2_cts']['cts_control'].flatten(), blm_model.A2_cts['cts_control'].flatten(), '.', label='A2_cts', color='green')
plt.xlabel('Truth')
plt.ylabel('Estimate')
plt.legend()
plt.grid()
plt.show()

plt.plot(sim_params['S1_cts']['cts_control'].flatten(), blm_model.S1_cts['cts_control'].flatten(), '.', label='S1_cts', color='red')
plt.plot(sim_params['S2_cts']['cts_control'].flatten(), blm_model.S2_cts['cts_control'].flatten(), '.', label='S2_cts', color='green')
plt.xlabel('Truth')
plt.ylabel('Estimate')
plt.legend()
plt.grid()
plt.show()

plt.plot(sim_params['pk1'].flatten(), blm_model.pk1.flatten(), '.', label='pk1', color='red')
plt.xlabel('Truth')
plt.ylabel('Estimate')
plt.legend()
plt.grid()
plt.show()

plt.plot(sim_params['pk0'].flatten(), blm_model.pk0.flatten(), '.', label='pk0', color='red')
plt.xlabel('Truth')
plt.ylabel('Estimate')
plt.legend()
plt.grid()
plt.show()
../_images/notebooks_blm_example_53_0.png
../_images/notebooks_blm_example_53_1.png
../_images/notebooks_blm_example_53_2.png
../_images/notebooks_blm_example_53_3.png
../_images/notebooks_blm_example_53_4.png
../_images/notebooks_blm_example_53_5.png
../_images/notebooks_blm_example_53_6.png
../_images/notebooks_blm_example_53_7.png

Warning

The parameters are identified only up to a constant intercept.

Because of this, to maintain consistency between estimations, PyTwoWay automatically normalizes worker-firm types to have effect 0, depending on the control variables used (normalization only occurs with categorical control variables - for a stationary control variable, only the first period is normalized (if it is non-stationary, both periods are normalized); for a control variable that does not interact with worker types, only the lowest worker type is normalized (if it does interact with worker types, all worker types are normalized)).

If we take the sum over the estimators we see the model performs well.

[31]:
## First period ##
plt.plot(
    (sim_params['A1'] + sim_params['A1_cat']['cat_control'][0]).flatten(),
    (blm_model.A1 + blm_model.A1_cat['cat_control'][0]).flatten(),
    '.', label='A1 + A1_cat[0]', color='red'
)
plt.plot(
    (sim_params['A1'] + sim_params['A1_cat']['cat_control'][1]).flatten(),
    (blm_model.A1 + blm_model.A1_cat['cat_control'][1]).flatten(),
    '.', label='A1 + A1_cat[1]', color='green'
)
plt.xlabel('Truth')
plt.ylabel('Estimate')
plt.legend()
plt.grid()
plt.show()

plt.plot(
    np.sqrt((sim_params['S1'] ** 2 + sim_params['S1_cat']['cat_control'][0] ** 2).flatten()),
    np.sqrt((blm_model.S1 ** 2 + blm_model.S1_cat['cat_control'][0] ** 2).flatten()),
    '.', label='sqrt(S1^2 + S1_cat[0]^2)', color='red'
)
plt.plot(
    np.sqrt((sim_params['S1'] ** 2 + sim_params['S1_cat']['cat_control'][1] ** 2).flatten()),
    np.sqrt((blm_model.S1 ** 2 + blm_model.S1_cat['cat_control'][1] ** 2).flatten()),
    '.', label='sqrt(S1^2 + S1_cat[1]^2)', color='green'
)
plt.xlabel('Truth')
plt.ylabel('Estimate')
plt.legend()
plt.grid()
plt.show()

## Second period ##
plt.plot(
    (sim_params['A2'] + sim_params['A2_cat']['cat_control'][0]).flatten(),
    (blm_model.A2 + blm_model.A2_cat['cat_control'][0]).flatten(),
    '.', label='A2 + A2_cat[0]', color='red'
)
plt.plot(
    (sim_params['A2'] + sim_params['A2_cat']['cat_control'][1]).flatten(),
    (blm_model.A2 + blm_model.A2_cat['cat_control'][1]).flatten(),
    '.', label='A2 + A2_cat[1]', color='green'
)
plt.xlabel('Truth')
plt.ylabel('Estimate')
plt.legend()
plt.grid()
plt.show()

plt.plot(
    np.sqrt((sim_params['S2'] ** 2 + sim_params['S2_cat']['cat_control'][0] ** 2).flatten()),
    np.sqrt((blm_model.S2 ** 2 + blm_model.S2_cat['cat_control'][0] ** 2).flatten()),
    '.', label='sqrt(S2^2 + S2_cat[0]^2)', color='red'
)
plt.plot(
    np.sqrt((sim_params['S2'] ** 2 + sim_params['S2_cat']['cat_control'][1] ** 2).flatten()),
    np.sqrt((blm_model.S2 ** 2 + blm_model.S2_cat['cat_control'][1] ** 2).flatten()),
    '.', label='sqrt(S2^2 + S2_cat[1]^2)', color='green'
)
plt.xlabel('Truth')
plt.ylabel('Estimate')
plt.legend()
plt.grid()
plt.show()
../_images/notebooks_blm_example_55_0.png
../_images/notebooks_blm_example_55_1.png
../_images/notebooks_blm_example_55_2.png
../_images/notebooks_blm_example_55_3.png

Variance decomposition

The package contains the class BLMVarianceDecomposition, which allows for the estimation of the decomposition of the variance components of the BLM estimator.

In this example, we use the same simulated data and estimation parameters as in the previous example.

[ ]:
# Initialize BLM variance decomposition estimator
blm_fit = tw.BLMVarianceDecomposition(blm_params)
# Fit BLM estimator
blm_fit.fit(
    jdata=jdata, sdata=sdata,
    blm_model=blm_model,
    n_samples=20,
    Q_var=[
        tw.Q.VarCovariate('psi'),
        tw.Q.VarCovariate('cat_control'),
        tw.Q.VarCovariate('cts_control')
    ],
    ncore=1
)

Investigate the results

Results correspond to:

  • y: income (outcome) column

  • eps: residual

  • psi: firm effects

  • alpha: worker effects

Note

Results from all estimations are stored in the class attribute dictionary .res. We take the mean, but storing all results gives the option to estimate different sample statistics.

Note

The particular variance that is estimated is controlled through the parameter 'Q_var' and the covariance that is estimated is controlled through the parameter 'Q_cov'.

By default, the variance is var(psi) and the covariance is cov(psi, alpha). The default estimates don’t include var(alpha), but if you don’t include controls, var(alpha) can be computed as the residual from var(y) = var(psi) + var(alpha) + 2 * cov(psi, alpha) + var(eps).

[33]:
for k, v in blm_fit.res.items():
    print(f'{k!r}: {v.mean()}')
'var(y)': 4.1784778224072054
'var(eps)': 3.862767039874967
'var(cat_control)': 0.17504851379038694
'var(cts_control)': 0.03529768004099764
'var(psi)': 0.052474105201016495
'cov(psi, alpha)': -0.0014679055738114212