PSID Data Analysis Code Camp Tutorial

Welcome to the Economics PhD Code Camp! In this tutorial, we'll build a complete data analysis workflow using PSID (Panel Study of Income Dynamics) data as our example. By the end, you'll have a reproducible research environment with proper version control, testing, and automation.

Prerequisites

A computer with administrator privileges
Basic familiarity with using a computer (file navigation, installing software)
No prior programming experience required!

Step 1: Setting Up Your Development Environment

Installing the Essential Tools

We'll start by installing three fundamental tools that form the backbone of modern development:

Visual Studio Code (VSCode): Download and install from https://code.visualstudio.com/
Docker Desktop: Download and install from https://www.docker.com/products/docker-desktop/
Dev Containers Extension: Open VSCode, go to Extensions (Ctrl+Shift+X), search for "Dev Containers" by Microsoft, and install it.

What is Docker? Docker is a containerization platform that packages your code and all its dependencies into a portable "container." This ensures your code runs the same way on any machine, eliminating the "it works on my computer" problem that plagues collaborative research.

After installation, restart your computer to ensure Docker is properly initialized.

Step 2: Creating Your Development Container

Understanding Dev Containers

A development container (devcontainer) is a Docker container specifically configured for development work. It includes all the tools, libraries, and dependencies you need for your project, creating a consistent development environment that can be shared across your team.

Setting Up Your Project Structure

Create a new folder on your desktop called psid-analysis
Open this folder in VSCode (File → Open Folder)

Create the following folder structure:

psid-analysis/
├── .devcontainer/
│   └── devcontainer.json
├── notebooks/
├── src/
└── tests/

Creating Your Devcontainer Configuration

Create a file called devcontainer.json inside the .devcontainer folder with this template:

{
    "name": "Economics Research Environment",
    "image": "mcr.microsoft.com/devcontainers/python:1-3.11-bullseye",
    "customizations": {
        "vscode": {
            "extensions": [
                "ms-python.python",
                "ms-toolsai.jupyter"
            ]
        }
    },
    "postCreateCommand": "pip install -r requirements.txt",
    "remoteUser": "vscode"
}

Your Task: Look at this configuration and answer these questions before proceeding: - What Python version are we using? - What extensions will be automatically installed? - What command runs after the container is created?

Opening Your Project in the Container

Press Ctrl+Shift+P (or Cmd+Shift+P on Mac) to open the command palette
Type "Dev Containers: Reopen in Container" and select it
Wait for the container to build (this may take a few minutes the first time)

You'll know it worked when you see "Dev Container: Economics Research Environment" in the bottom-left corner of VSCode.

Step 3: Your First Python Code

Understanding Jupyter Notebooks

A Jupyter notebook is an interactive computing environment that combines code, visualizations, and narrative text. It's perfect for data analysis because you can see results immediately and document your thought process alongside your code.

Creating Your First Notebook

In the notebooks folder, create a new file called exploration.ipynb
VSCode should recognize this as a Jupyter notebook
Create a new code cell and type:

print("Hello, Economics Code Camp!")
print(f"I'm running Python in a container!")

# Let's do some basic math
gdp_2020 = 21_427_700  # US GDP in millions of dollars
gdp_2021 = 23_315_080
growth_rate = (gdp_2021 - gdp_2020) / gdp_2020 * 100

print(f"US GDP growth rate 2020-2021: {growth_rate:.2f}%")

Your Task: Before running this cell, predict what the output will be. Then run it using Shift+Enter.

Step 4: Navigating the Terminal

Understanding the Terminal

The terminal (also called command line or shell) is a text-based interface to your computer. While it might seem intimidating at first, it's incredibly powerful for data analysis tasks and is essential for version control and automation.

Basic Terminal Operations

In VSCode, open the terminal with Ctrl+ (backtick) or Terminal → New Terminal.

Try these commands one by one and observe what happens:

pwd                    # Print working directory
ls                     # List files and folders
ls -la                 # List with details
cd notebooks          # Change to notebooks directory
cd ..                 # Go back to parent directory
mkdir test_folder     # Create a new folder
rmdir test_folder     # Remove the folder

Your Task: Use the terminal to: 1. Navigate to your src folder 2. Create a file called hello.py using touch hello.py 3. List the contents to confirm it was created 4. Navigate back to the root of your project

Step 5: Running Python Scripts

Creating Your First Script

In the src folder, create a file called data_summary.py
Add this template code:

# data_summary.py
"""
A simple script to demonstrate basic data operations
"""

def main():
    # Sample earnings data (in thousands)
    earnings_data = [45, 52, 38, 67, 43, 55, 49, 61, 39, 58]

    # Your task: Calculate these statistics without using any libraries
    # Think about how you would compute each of these manually

    mean_earnings = # TODO: Calculate mean
    median_earnings = # TODO: Calculate median (hint: sort first)
    max_earnings = # TODO: Find maximum
    min_earnings = # TODO: Find minimum

    print(f"Earnings Analysis:")
    print(f"Mean: ${mean_earnings:.2f}k")
    print(f"Median: ${median_earnings:.2f}k") 
    print(f"Range: ${min_earnings:.2f}k - ${max_earnings:.2f}k")

if __name__ == "__main__":
    main()

Your Task: Complete the TODO sections. Think about the algorithms: - Mean: sum all values, divide by count - Median: sort the list, take the middle value (or average of two middle values) - Max/Min: iterate through to find extremes

Running Your Script

From the terminal, run:

python src/data_summary.py

Step 6: Managing Dependencies

Creating a Requirements File

Create a requirements.txt file in your project root:

pandas>=1.5.0
numpy>=1.20.0
matplotlib>=3.5.0
seaborn>=0.11.0
pyreadstat>=1.1.0
jupyter>=1.0.0

Your Task: Research what each of these packages does: - pandas: ? - numpy: ? - pyreadstat: ? (Hint: this is specifically for economists!)

Installing Dependencies

Rebuild your container to install these packages: 1. Press Ctrl+Shift+P 2. Select "Dev Containers: Rebuild Container"

Step 7: Version Control with Git and GitHub

Understanding Git and GitHub

Git is a version control system that tracks changes to your files over time, like a sophisticated "undo" system for your entire project. GitHub is a cloud-based platform that stores your Git repositories and enables collaboration, acting like "Google Drive for code" but with powerful project management features.

Essential Git Commands

You'll use these commands constantly:

git status - Shows what files have changed
git add <file> - Stages files for commit
git commit -m "message" - Saves changes with a description
git push - Uploads changes to GitHub
git pull - Downloads changes from GitHub

Setting Up GitHub

Create a GitHub account at github.com
Create a new private repository called psid-analysis
Don't initialize with README, .gitignore, or license (we'll do this locally)

Configuring Git in Your Container

In the terminal, run these commands (replace with your information):

git config --global user.name "Your Name"
git config --global user.email "your.email@university.edu"

Initializing Your Repository

git init
git branch -M main
git remote add origin https://github.com/YOUR_USERNAME/psid-analysis.git

Your Task: Replace YOUR_USERNAME with your actual GitHub username.

Your First Commit

Create a .gitignore file in your project root:

__pycache__/
*.pyc
.pytest_cache/
.vscode/settings.json
*.log
data/raw/
.env

Your Task: Research what each line in .gitignore does. Why might we want to ignore these files?

Now commit your work:

git add .
git status                    # Check what will be committed
git commit -m "Initial project setup with devcontainer"
git push -u origin main

Step 8: Creating Your Analysis Module

Understanding Code Factorization

Code factorization means organizing your code into reusable functions and modules instead of writing everything in one long script. This approach offers several benefits: it makes your code easier to test and debug, promotes reusability across different projects, and makes collaboration much smoother since team members can work on different modules independently.

Setting Up Your Module Structure

Create this structure in your src folder:

src/
├── camp/
│   ├── __init__.py
│   └── analysis.py
└── data_summary.py

Understanding Python Imports

When you create a folder with an __init__.py file, Python treats it as a package. The __init__.py file controls what gets imported when someone uses import camp. It can be empty (making all modules available) or can explicitly define what should be accessible.

Creating Your Analysis Module

In src/camp/analysis.py, create this template:

"""
Statistical analysis functions for economics research
"""
import pandas as pd
import numpy as np
from typing import Dict, Union

def compute_moments(data: pd.Series, variable_name: str) -> Dict[str, float]:
    """
    Compute the first four moments of a statistical distribution.

    Your task: Implement this function to calculate:
    1. Mean (1st moment)
    2. Variance (2nd central moment) 
    3. Skewness (3rd standardized moment)
    4. Kurtosis (4th standardized moment)

    Args:
        data: A pandas Series containing the data
        variable_name: Name of the variable for reporting

    Returns:
        Dictionary with moment statistics

    Think about:
    - What do each of these moments tell us about the data distribution?
    - How would you interpret high skewness in earnings data?
    - What does kurtosis reveal about tail behavior?
    """

    # Remove any missing values
    clean_data = data.dropna()

    # TODO: Calculate each moment
    # Hint: pandas Series have methods like .mean(), .var(), .skew(), .kurtosis()
    # But try to understand what these actually compute!

    moments = {
        'count': len(clean_data),
        'mean': # TODO,
        'variance': # TODO,
        'skewness': # TODO,
        'kurtosis': # TODO
    }

    return moments

def analyze_variable(df: pd.DataFrame, column: str) -> None:
    """
    Print a formatted analysis of a variable.

    Your task: Complete this function to nicely display the moments
    """
    # TODO: Use compute_moments and format the output nicely
    pass

Setting Up Your Module for Development

Add this line to your requirements.txt:

-e ./src

What does -e do? The -e flag installs your package in "editable" mode, meaning changes to your code are immediately available without reinstalling. This is essential for development work.

Making Your Module Importable

In src/camp/__init__.py:

"""
CAMP: Code Analysis Module for Economists
"""
from .analysis import compute_moments, analyze_variable

__version__ = "0.1.0"

Your Task: After completing the functions, test your module in a notebook:

import camp
import pandas as pd

# Create some sample data
sample_earnings = pd.Series([45000, 52000, 38000, 67000, 43000, 55000])
result = camp.compute_moments(sample_earnings, "earnings")
print(result)

Adding Autoreload Magic

Understanding Jupyter Magic Commands: Magic commands are special Jupyter features that start with % (line magic) or %% (cell magic). They provide shortcuts for common tasks like timing code, loading files, or in this case, automatically reloading modules when they change.

Add this to the top of your notebook:

%load_ext autoreload
%autoreload 2

Your Task: Add these magic commands to your notebook and explain in a markdown cell what they do and why they're useful for development.

Step 9: Adding Type Hints

Understanding Type Hints

Type hints are annotations that specify what types of data your functions expect and return. They make your code self-documenting, help catch errors early, and improve IDE support with better autocomplete and error detection.

Your Task: Add Type Hints

Look at your compute_moments function. It already has some type hints, but examine them carefully:

def compute_moments(data: pd.Series, variable_name: str) -> Dict[str, float]:

Questions to consider: - What does pd.Series tell us about the expected input? - What does Dict[str, float] tell us about the return value? - Are there any issues with this return type hint given our current implementation?

Improving Your Type Hints

Consider this enhanced version:

from typing import Dict, Union, Optional

def compute_moments(
    data: pd.Series, 
    variable_name: str
) -> Dict[str, Union[int, float]]:

Your Task: Explain why Union[int, float] might be better than just float for our return type.

Step 10: Testing Your Code

Understanding Testing

Testing is the practice of writing code to automatically verify that your functions work correctly. Tests serve as a safety net that catches bugs when you modify code, and they also serve as documentation showing how your functions should be used.

Creating Your First Test

Create tests/test_analysis.py:

"""
Tests for the camp.analysis module
"""
import pytest
import pandas as pd
import numpy as np
from camp.analysis import compute_moments

def test_compute_moments_basic():
    """Test basic functionality of compute_moments"""

    # Arrange: Set up test data
    # Create a simple dataset where we know the expected results
    test_data = pd.Series([1, 2, 3, 4, 5])

    # Act: Run the function
    result = compute_moments(test_data, "test_variable")

    # Assert: Check the results
    # Your task: Fill in the expected values
    # Hint: For data [1,2,3,4,5], mean = 3, variance = 2.5

    assert result['count'] == 5
    assert result['mean'] == # TODO: Calculate expected mean
    assert abs(result['variance'] - # TODO) < 0.001  # Using abs for floating point comparison

def test_compute_moments_with_missing_data():
    """Test that function handles missing data correctly"""

    # Your task: Create test data with NaN values
    # Test that the function excludes them properly

    test_data = pd.Series([1, 2, np.nan, 4, 5])
    result = compute_moments(test_data, "test_variable")

    # What should the count be?
    assert result['count'] == # TODO

def test_compute_moments_empty_series():
    """Test edge case: what happens with empty data?"""

    # Your task: Think about what should happen with empty data
    # Should it raise an error? Return None? Return zeros?
    # Implement your chosen behavior in the function and test it here

    pass  # Replace with your test

# Your task: Add more test cases
# Consider: What other edge cases should you test?
# - All identical values?
# - Only one value?  
# - Very large numbers?

Installing pytest

Add to your requirements.txt:

pytest>=7.0.0

Rebuild your container.

Running Your Tests

pytest tests/ -v

Your Task: Make sure all your tests pass. If they fail, debug both your function and your tests.

Step 11: Continuous Integration with GitHub Actions

Understanding GitHub Actions

GitHub Actions is an automation platform that runs code (like tests) automatically when certain events happen in your repository, such as pushing new code. This ensures that all code in your repository meets quality standards and helps catch bugs before they reach production.

Creating Your First Action

Create .github/workflows/tests.yml:

name: Run Tests

on:
  push:
    branches: [ main ]
  pull_request:
    branches: [ main ]

jobs:
  test:
    runs-on: ubuntu-latest

    steps:
    - uses: actions/checkout@v3

    - name: Set up Python
      uses: actions/setup-python@v4
      with:
        python-version: '3.11'

    - name: Install dependencies
      run: |
        python -m pip install --upgrade pip
        pip install -r requirements.txt

    - name: Run tests
      run: |
        pytest tests/ -v

Your Task: Before committing this file, answer these questions: - When will this action run? - What Python version will it use? - What happens if a test fails?

Testing Your Automation

Commit and push your changes:

git add .
git commit -m "Add testing and GitHub Actions"
git push

Go to your GitHub repository and click on the "Actions" tab
Watch your tests run automatically!

Making a Test Fail

Your Task: 1. Temporarily break one of your tests (change an expected value) 2. Commit and push 3. Watch the action fail 4. Fix the test and push again 5. Confirm the action passes

This demonstrates how GitHub Actions catches problems automatically.

Step 12: Putting It All Together

Final Project Structure

Your project should now look like this:

psid-analysis/
├── .devcontainer/
│   └── devcontainer.json
├── .github/
│   └── workflows/
│       └── tests.yml
├── notebooks/
│   └── exploration.ipynb
├── src/
│   └── camp/
│       ├── __init__.py
│       └── analysis.py
├── tests/
│   └── test_analysis.py
├── .gitignore
└── requirements.txt

Creating a Comprehensive Analysis

In your exploration.ipynb, create a comprehensive analysis that:

Imports your camp module
Creates or loads sample economic data
Uses your compute_moments function
Interprets the results in economic terms

Your Task: Write a markdown cell explaining what each moment tells us about economic data. For example: - What does high skewness in income data typically indicate? - How might kurtosis be relevant for risk analysis?

Final Commit

git add .
git commit -m "Complete analysis workflow with testing and CI"
git push

Congratulations! 🎉

You've built a complete, professional data analysis workflow including:

Reproducible environment with Docker containers
Version control with Git and GitHub
Modular code with proper Python packaging
Type safety with type hints
Quality assurance with automated testing
Continuous integration with GitHub Actions

Next Steps

This foundation prepares you for real economic research:

Data Integration: Replace sample data with actual PSID files using pyreadstat
Advanced Analysis: Add econometric functions (regression, panel data methods)
Visualization: Create publication-ready plots with matplotlib/seaborn
Documentation: Add docstrings and generate documentation with Sphinx
Collaboration: Use branches and pull requests for team research

Key Takeaways

Automation saves time: Setting up proper tooling takes effort upfront but saves hours later
Testing prevents bugs: Small bugs in economic analysis can lead to wrong policy conclusions
Version control enables collaboration: Essential for reproducible research
Modular code is reusable: Functions you write for one project can be used in others

Remember: these tools aren't just for software developers—they're essential for any researcher working with data in the modern era.