BipartiteBase class

class bipartitepandas.bipartitebase.BipartiteBase(*args, columns_req=None, columns_opt=None, columns_contig=None, col_reference_dict=None, col_dtype_dict=None, col_collapse_dict=None, col_long_es_dict=None, track_id_changes=False, log=False, **kwargs)

Bases: DataFrame

Base class for BipartitePandas, where BipartitePandas gives a bipartite network of firms and workers. Contains generalized methods. Inherits from DataFrame.

Parameters
  • *args – arguments for Pandas DataFrame

  • columns_req (list or None) – required columns (only put general column names for joint columns, e.g. put ‘j’ instead of ‘j1’, ‘j2’; then put the joint columns in col_reference_dict); None is equivalent to []

  • columns_opt (list or None) – optional columns (only put general column names for joint columns, e.g. put ‘g’ instead of ‘g1’, ‘g2’; then put the joint columns in col_reference_dict); None is equivalent to []

  • columns_contig (dict or None) – columns of categorical ids linked to boolean of whether those ids are contiguous, or None if column(s) not included, e.g. {‘i’: False, ‘j’: False, ‘g’: None} (only put general column names for joint columns); None is equivalent to {}

  • col_reference_dict (dict or None) – clarify which joint columns are associated with a general column name, e.g. {‘i’: ‘i’, ‘j’: [‘j1’, ‘j2’]}; None is equivalent to {}

  • col_dtype_dict (dict or None) – link column to datatype, e.g. {‘m’: ‘int’}; None is equivalent to {}

  • col_collapse_dict (dict or None) – how to collapse column (None indicates the column should be dropped), e.g. {‘y’: ‘mean’}; None is equivalent to {}

  • col_long_es_dict (dict or None) – whether each column should split into two when converting from long to event study (None indicates the column should be dropped), e.g. {‘y’: True, ‘m’: None}; None is equivalent to {}

  • track_id_changes (bool) – if True, create dictionary of Pandas dataframes linking original categorical id values to updated contiguous id values

  • log (bool) – if True, will create log file(s)

  • **kwargs – keyword arguments for Pandas DataFrame

add_column(col_name, col_data=None, col_reference=None, is_categorical=False, dtype='any', how_collapse='first', long_es_split=True, copy=True)

Safe method for adding custom columns. Columns added with this method will be compatible with conversions between long, collapsed long, event study, and collapsed event study formats.

Parameters
  • col_name (str) – general column name

  • col_data (NumPy Array or Pandas Series or list of (NumPy Array or Pandas Series) or None) – data for column, or list of data for columns; set to None if columns already added to dataframe via column assignment

  • col_reference (str or list of str) – if column has multiple subcolumns (e.g. firm ids are associated with the columns [‘j1’, ‘j2’]) this must be specified; otherwise, None will automatically default to the column name (plus a column number, if more than one column is listed) (e.g. firm ids are associated with the column ‘j’ if one column is included, or [‘j1’, ‘j2’] if two columns are included)

  • is_categorical (bool) – if True, column is categorical

  • dtype (str) – column datatype, must be one of ‘int’, ‘float’, ‘any’, or ‘categorical’

  • how_collapse (function or str or None) – how to collapse data at the worker-firm spell level, must be a valid input for Pandas groupby; if None, column will be dropped during collapse/uncollapse

  • long_es_split (bool or None) if True, column should split into two when converting from long to event study; if None, column will be dropped when converting between (collapsed) long and (collapsed) –

  • copy (bool) – if False, avoid copy

Returns

dataframe with new column(s)

Return type

(BipartiteBase)

cluster(params=None, rng=None)

Cluster data and assign a new column giving the cluster for each firm.

Parameters
  • params (ParamsDict or None) – dictionary of parameters for clustering. Run bpd.cluster_params().describe_all() for descriptions of all valid parameters. None is equivalent to bpd.cluster_params().

  • rng (np.random.Generator) – NumPy random number generator; None is equivalent to np.random.default_rng(None)

Returns

if silhouette=False, return dataframe with clusters; if silhouette=True, return tuple where first element is dataframe with clusters and second element is NumPy Array of each firm’s silhouette score

Return type

(BipartiteBase or tuple of (BipartiteBase, NumPy Array))

copy(deep=True)

Return copy of self.

Parameters

deep (bool) – make a deep copy, including a copy of the data and the indices. If False, neither the indices nor the data are copied.

Returns

copy of dataframe

Return type

(BipartiteBase)

diagnostic()

Run diagnostic and print diagnostic report.

drop(labels, axis=0, inplace=False, allow_optional=False, allow_required=False, **kwargs)

Drop labels along axis.

Parameters
  • labels (int or str, optionally as a list) – row(s) or column(s) to drop. For columns, use general column names for joint columns, e.g. put ‘g’ instead of ‘g1’, ‘g2’. Only user-added columns may be dropped, unless allow_optional or allow_required is set to True.

  • axis (int or str) – whether to drop labels from the ‘index’ (0) or ‘columns’ (1)

  • inplace (bool) – if True, modify in-place

  • allow_optional (bool) – if True, allow to drop optional columns

  • allow_required (bool) – if True, allow to drop required columns

  • **kwargs – keyword arguments for Pandas drop

Returns

dataframe without dropped labels

Return type

(BipartiteBase)

drop_rows(rows, drop_returns_to_stays=False, is_sorted=False, reset_index=True, copy=True)

Drop particular rows.

Parameters
  • rows (list) – rows to keep

  • drop_returns_to_stays (bool) – If True, when recollapsing collapsed data, drop observations that need to be recollapsed instead of collapsing (this is for computational efficiency when re-collapsing data for leave-one-out connected components, where intermediate observations can be dropped, causing a worker who returns to a firm to become a stayer)

  • is_sorted (bool) – if False, dataframe will be sorted by i (and t, if included). Returned dataframe is not guaranteed to be sorted if original dataframe is not sorted for long and collapsed long formats, but is guaranteed to be sorted for event study and collapsed event study formats. Sorting may alter original dataframe if copy is set to False. Set is_sorted to True if dataframe is already sorted.

  • reset_index (bool) – if True, reset index at end

  • copy (bool) – if False, avoid copy

Returns

dataframe with given rows dropped

Return type

(BipartiteBase)

get_column_properties(col_name)

Return dictionary linking properties to their value for a particular column.

Parameters

col_name (str) – general column name whose properties will be printed

Returns

dictionary linking properties to their value for a particular column (‘general_column’: general column name; ‘subcolumns’: subcolumns linked to general column; ‘dtype’: column datatype; ‘is_categorical’: column is categorical; ‘how_collapse’: how to collapse at the worker-firm spell level (None if dropped during collapse); ‘long_es_split’: whether column should split into two columns when converting between long and event study formats (None if dropped during conversion))

Return type

(dict)

log(message, level='info')

Log a message at the specified level.

Parameters
  • message (str) – message to log

  • level (str) – logger level. Options, in increasing severity, are ‘debug’, ‘info’, ‘warning’, ‘error’, and ‘critical’.

log_on(on=True)

Toggle logger on or off.

Parameters

on (bool) – if True, turn logger on; if False, turn logger off

merge(*args, **kwargs)

Merge two BipartiteBase objects.

Parameters
  • *args – arguments for Pandas merge

  • **kwargs – keyword arguments for Pandas merge

Returns

merged dataframe

Return type

(BipartiteBase)

min_movers_firms(threshold=15, is_sorted=False, copy=True)

List firms with at least threshold many movers.

Parameters
  • threshold (int) – minimum number of movers required to keep a firm

  • is_sorted (bool) – used for event study format. If False, dataframe will be sorted by i (and t, if included). Sorting may alter original dataframe if copy is set to False. Set is_sorted to True if dataframe is already sorted.

  • copy (bool) – used for event study format. If False, avoid copy.

Returns

firms with sufficiently many movers

Return type

(NumPy Array)

n_clusters()

Get the number of unique clusters.

Returns

number of unique clusters if cluster column included; None otherwise

Return type

(int or None)

n_firms()

Get the number of unique firms.

Returns

number of unique firms

Return type

(int)

n_unique_ids(id_col)

Number of unique ids in column.

Parameters

id_col (str) – column to check ids (‘i’, ‘j’, or ‘g’). Use general column names for joint columns, e.g. put ‘j’ instead of ‘j1’, ‘j2’.

Returns

number of unique ids if column included; None otherwise

Return type

(int or None)

n_workers()

Get the number of unique workers.

Returns

number of unique workers

Return type

(int)

original_ids(copy=True)

Return self merged with original column ids.

Parameters

copy (bool) – if False, avoid copy

Returns

copy of dataframe merged with original column ids, or None if id_reference_dict is empty

Return type

(BipartiteBase or None)

print_column_properties(col_name)

Print properties associated with a particular column.

Parameters

col_name (str) – general column name whose properties will be printed

rename(rename_dict, axis=0, inplace=False, allow_optional=False, allow_required=False, **kwargs)

Rename a column.

Parameters
  • rename_dict (dict) – key is current label, value is new label. When renaming columns, use general column names for joint columns, e.g. put ‘g’ instead of ‘g1’, ‘g2’.

  • axis (int or str) – whether to drop labels from the ‘index’ (0) or ‘columns’ (1)

  • inplace (bool) – if True, modify in-place

  • allow_optional (bool) – if True, allow to rename optional columns

  • allow_required (bool) – if True, allow to rename required columns

  • **kwargs – keyword arguments for Pandas rename

Returns

dataframe with renamed labels

Return type

(BipartiteBase)

set_column_properties(col_name, is_categorical=False, dtype='any', how_collapse='first', long_es_split=True, copy=True)

Safe method for setting the properties of pre-existing custom columns.

Parameters
  • col_name (str) – general column name

  • is_categorical (bool) – if True, column is categorical

  • dtype (str) – column datatype, must be one of ‘int’, ‘float’, ‘any’, or ‘categorical’

  • how_collapse (function or str or None) – how to collapse data at the worker-firm spell level, must be a valid input for Pandas groupby; if None, column will be dropped during collapse/uncollapse

  • long_es_split (bool or None) if True, column should split into two when converting from long to event study; if None, column will be dropped when converting between (collapsed) long and (collapsed) –

  • copy (bool) – if False, avoid copy

Returns

dataframe with new column(s)

Return type

(BipartiteBase)

sort_cols(copy=True)

Sort frame columns (not in-place).

Parameters

copy (bool) – if False, avoid copy

Returns

dataframe with sorted columns

Return type

(BipartiteBase)

sort_rows(j_if_no_t=True, is_sorted=False, copy=True)

Sort rows by i and t.

Parameters
  • j_if_no_t (bool) – if no time column, sort on i and j columns instead

  • is_sorted (bool) – if False, dataframe will be sorted by i (and t, if included). Returned dataframe will be sorted. Sorting may alter original dataframe if copy is set to False. Set is_sorted to True if dataframe is already sorted.

  • copy (bool) – if False, avoid copy

Returns

dataframe with rows sorted

Return type

(BipartiteBase)

summary()

Print summary statistics. This uses class attributes. To run a diagnostic to verify these values, run .diagnostic().

unique_ids(id_col)

Unique ids in column.

Parameters

id_col (str) – column to check ids (‘i’, ‘j’, or ‘g’). Use general column names for joint columns, e.g. put ‘j’ instead of ‘j1’, ‘j2’

Returns

unique ids if column included; None otherwise

Return type

(NumPy Array or None)