PSID Data Analysis Code Camp Tutorial
Welcome to the Economics PhD Code Camp! In this tutorial, we'll build a complete data analysis workflow using PSID (Panel Study of Income Dynamics) data as our example. By the end, you'll have a reproducible research environment with proper version control, testing, and automation.
Prerequisites
- A computer with administrator privileges
- Basic familiarity with using a computer (file navigation, installing software)
- No prior programming experience required!
Step 1: Setting Up Your Development Environment
Installing the Essential Tools
We'll start by installing three fundamental tools that form the backbone of modern development:
- Visual Studio Code (VSCode): Download and install from https://code.visualstudio.com/
- Docker Desktop: Download and install from https://www.docker.com/products/docker-desktop/
- Dev Containers Extension: Open VSCode, go to Extensions (Ctrl+Shift+X), search for "Dev Containers" by Microsoft, and install it.
What is Docker? Docker is a containerization platform that packages your code and all its dependencies into a portable "container." This ensures your code runs the same way on any machine, eliminating the "it works on my computer" problem that plagues collaborative research.
After installation, restart your computer to ensure Docker is properly initialized.
Step 2: Creating Your Development Container
Understanding Dev Containers
A development container (devcontainer) is a Docker container specifically configured for development work. It includes all the tools, libraries, and dependencies you need for your project, creating a consistent development environment that can be shared across your team.
Setting Up Your Project Structure
- Create a new folder on your desktop called
psid-analysis - Open this folder in VSCode (File β Open Folder)
- Create the following folder structure:
Creating Your Devcontainer Configuration
Create a file called devcontainer.json inside the .devcontainer folder with this template:
{
"name": "Economics Research Environment",
"image": "mcr.microsoft.com/devcontainers/python:1-3.11-bullseye",
"customizations": {
"vscode": {
"extensions": [
"ms-python.python",
"ms-toolsai.jupyter"
]
}
},
"postCreateCommand": "pip install -r requirements.txt",
"remoteUser": "vscode"
}
Your Task: Look at this configuration and answer these questions before proceeding: - What Python version are we using? - What extensions will be automatically installed? - What command runs after the container is created?
Opening Your Project in the Container
- Press
Ctrl+Shift+P(orCmd+Shift+Pon Mac) to open the command palette - Type "Dev Containers: Reopen in Container" and select it
- Wait for the container to build (this may take a few minutes the first time)
You'll know it worked when you see "Dev Container: Economics Research Environment" in the bottom-left corner of VSCode.
Step 3: Your First Python Code
Understanding Jupyter Notebooks
A Jupyter notebook is an interactive computing environment that combines code, visualizations, and narrative text. It's perfect for data analysis because you can see results immediately and document your thought process alongside your code.
Creating Your First Notebook
- In the
notebooksfolder, create a new file calledexploration.ipynb - VSCode should recognize this as a Jupyter notebook
- Create a new code cell and type:
print("Hello, Economics Code Camp!")
print(f"I'm running Python in a container!")
# Let's do some basic math
gdp_2020 = 21_427_700 # US GDP in millions of dollars
gdp_2021 = 23_315_080
growth_rate = (gdp_2021 - gdp_2020) / gdp_2020 * 100
print(f"US GDP growth rate 2020-2021: {growth_rate:.2f}%")
Your Task: Before running this cell, predict what the output will be. Then run it using Shift+Enter.
Step 4: Navigating the Terminal
Understanding the Terminal
The terminal (also called command line or shell) is a text-based interface to your computer. While it might seem intimidating at first, it's incredibly powerful for data analysis tasks and is essential for version control and automation.
Basic Terminal Operations
In VSCode, open the terminal with Ctrl+ (backtick) or Terminal β New Terminal.
Try these commands one by one and observe what happens:
pwd # Print working directory
ls # List files and folders
ls -la # List with details
cd notebooks # Change to notebooks directory
cd .. # Go back to parent directory
mkdir test_folder # Create a new folder
rmdir test_folder # Remove the folder
Your Task: Use the terminal to:
1. Navigate to your src folder
2. Create a file called hello.py using touch hello.py
3. List the contents to confirm it was created
4. Navigate back to the root of your project
Step 5: Running Python Scripts
Creating Your First Script
- In the
srcfolder, create a file calleddata_summary.py - Add this template code:
# data_summary.py
"""
A simple script to demonstrate basic data operations
"""
def main():
# Sample earnings data (in thousands)
earnings_data = [45, 52, 38, 67, 43, 55, 49, 61, 39, 58]
# Your task: Calculate these statistics without using any libraries
# Think about how you would compute each of these manually
mean_earnings = # TODO: Calculate mean
median_earnings = # TODO: Calculate median (hint: sort first)
max_earnings = # TODO: Find maximum
min_earnings = # TODO: Find minimum
print(f"Earnings Analysis:")
print(f"Mean: ${mean_earnings:.2f}k")
print(f"Median: ${median_earnings:.2f}k")
print(f"Range: ${min_earnings:.2f}k - ${max_earnings:.2f}k")
if __name__ == "__main__":
main()
Your Task: Complete the TODO sections. Think about the algorithms: - Mean: sum all values, divide by count - Median: sort the list, take the middle value (or average of two middle values) - Max/Min: iterate through to find extremes
Running Your Script
From the terminal, run:
Step 6: Managing Dependencies
Creating a Requirements File
Create a requirements.txt file in your project root:
Your Task: Research what each of these packages does:
- pandas: ?
- numpy: ?
- pyreadstat: ? (Hint: this is specifically for economists!)
Installing Dependencies
Rebuild your container to install these packages:
1. Press Ctrl+Shift+P
2. Select "Dev Containers: Rebuild Container"
Step 7: Version Control with Git and GitHub
Understanding Git and GitHub
Git is a version control system that tracks changes to your files over time, like a sophisticated "undo" system for your entire project. GitHub is a cloud-based platform that stores your Git repositories and enables collaboration, acting like "Google Drive for code" but with powerful project management features.
Essential Git Commands
You'll use these commands constantly:
git status- Shows what files have changedgit add <file>- Stages files for commitgit commit -m "message"- Saves changes with a descriptiongit push- Uploads changes to GitHubgit pull- Downloads changes from GitHub
Setting Up GitHub
- Create a GitHub account at github.com
- Create a new private repository called
psid-analysis - Don't initialize with README, .gitignore, or license (we'll do this locally)
Configuring Git in Your Container
In the terminal, run these commands (replace with your information):
git config --global user.name "Your Name"
git config --global user.email "your.email@university.edu"
Initializing Your Repository
git init
git branch -M main
git remote add origin https://github.com/YOUR_USERNAME/psid-analysis.git
Your Task: Replace YOUR_USERNAME with your actual GitHub username.
Your First Commit
Create a .gitignore file in your project root:
Your Task: Research what each line in .gitignore does. Why might we want to ignore these files?
Now commit your work:
git add .
git status # Check what will be committed
git commit -m "Initial project setup with devcontainer"
git push -u origin main
Step 8: Creating Your Analysis Module
Understanding Code Factorization
Code factorization means organizing your code into reusable functions and modules instead of writing everything in one long script. This approach offers several benefits: it makes your code easier to test and debug, promotes reusability across different projects, and makes collaboration much smoother since team members can work on different modules independently.
Setting Up Your Module Structure
Create this structure in your src folder:
Understanding Python Imports
When you create a folder with an __init__.py file, Python treats it as a package. The __init__.py file controls what gets imported when someone uses import camp. It can be empty (making all modules available) or can explicitly define what should be accessible.
Creating Your Analysis Module
In src/camp/analysis.py, create this template:
"""
Statistical analysis functions for economics research
"""
import pandas as pd
import numpy as np
from typing import Dict, Union
def compute_moments(data: pd.Series, variable_name: str) -> Dict[str, float]:
"""
Compute the first four moments of a statistical distribution.
Your task: Implement this function to calculate:
1. Mean (1st moment)
2. Variance (2nd central moment)
3. Skewness (3rd standardized moment)
4. Kurtosis (4th standardized moment)
Args:
data: A pandas Series containing the data
variable_name: Name of the variable for reporting
Returns:
Dictionary with moment statistics
Think about:
- What do each of these moments tell us about the data distribution?
- How would you interpret high skewness in earnings data?
- What does kurtosis reveal about tail behavior?
"""
# Remove any missing values
clean_data = data.dropna()
# TODO: Calculate each moment
# Hint: pandas Series have methods like .mean(), .var(), .skew(), .kurtosis()
# But try to understand what these actually compute!
moments = {
'count': len(clean_data),
'mean': # TODO,
'variance': # TODO,
'skewness': # TODO,
'kurtosis': # TODO
}
return moments
def analyze_variable(df: pd.DataFrame, column: str) -> None:
"""
Print a formatted analysis of a variable.
Your task: Complete this function to nicely display the moments
"""
# TODO: Use compute_moments and format the output nicely
pass
Setting Up Your Module for Development
Add this line to your requirements.txt:
What does -e do? The -e flag installs your package in "editable" mode, meaning changes to your code are immediately available without reinstalling. This is essential for development work.
Making Your Module Importable
In src/camp/__init__.py:
"""
CAMP: Code Analysis Module for Economists
"""
from .analysis import compute_moments, analyze_variable
__version__ = "0.1.0"
Your Task: After completing the functions, test your module in a notebook:
import camp
import pandas as pd
# Create some sample data
sample_earnings = pd.Series([45000, 52000, 38000, 67000, 43000, 55000])
result = camp.compute_moments(sample_earnings, "earnings")
print(result)
Adding Autoreload Magic
Understanding Jupyter Magic Commands: Magic commands are special Jupyter features that start with % (line magic) or %% (cell magic). They provide shortcuts for common tasks like timing code, loading files, or in this case, automatically reloading modules when they change.
Add this to the top of your notebook:
Your Task: Add these magic commands to your notebook and explain in a markdown cell what they do and why they're useful for development.
Step 9: Adding Type Hints
Understanding Type Hints
Type hints are annotations that specify what types of data your functions expect and return. They make your code self-documenting, help catch errors early, and improve IDE support with better autocomplete and error detection.
Your Task: Add Type Hints
Look at your compute_moments function. It already has some type hints, but examine them carefully:
Questions to consider:
- What does pd.Series tell us about the expected input?
- What does Dict[str, float] tell us about the return value?
- Are there any issues with this return type hint given our current implementation?
Improving Your Type Hints
Consider this enhanced version:
from typing import Dict, Union, Optional
def compute_moments(
data: pd.Series,
variable_name: str
) -> Dict[str, Union[int, float]]:
Your Task: Explain why Union[int, float] might be better than just float for our return type.
Step 10: Testing Your Code
Understanding Testing
Testing is the practice of writing code to automatically verify that your functions work correctly. Tests serve as a safety net that catches bugs when you modify code, and they also serve as documentation showing how your functions should be used.
Creating Your First Test
Create tests/test_analysis.py:
"""
Tests for the camp.analysis module
"""
import pytest
import pandas as pd
import numpy as np
from camp.analysis import compute_moments
def test_compute_moments_basic():
"""Test basic functionality of compute_moments"""
# Arrange: Set up test data
# Create a simple dataset where we know the expected results
test_data = pd.Series([1, 2, 3, 4, 5])
# Act: Run the function
result = compute_moments(test_data, "test_variable")
# Assert: Check the results
# Your task: Fill in the expected values
# Hint: For data [1,2,3,4,5], mean = 3, variance = 2.5
assert result['count'] == 5
assert result['mean'] == # TODO: Calculate expected mean
assert abs(result['variance'] - # TODO) < 0.001 # Using abs for floating point comparison
def test_compute_moments_with_missing_data():
"""Test that function handles missing data correctly"""
# Your task: Create test data with NaN values
# Test that the function excludes them properly
test_data = pd.Series([1, 2, np.nan, 4, 5])
result = compute_moments(test_data, "test_variable")
# What should the count be?
assert result['count'] == # TODO
def test_compute_moments_empty_series():
"""Test edge case: what happens with empty data?"""
# Your task: Think about what should happen with empty data
# Should it raise an error? Return None? Return zeros?
# Implement your chosen behavior in the function and test it here
pass # Replace with your test
# Your task: Add more test cases
# Consider: What other edge cases should you test?
# - All identical values?
# - Only one value?
# - Very large numbers?
Installing pytest
Add to your requirements.txt:
Rebuild your container.
Running Your Tests
Your Task: Make sure all your tests pass. If they fail, debug both your function and your tests.
Step 11: Continuous Integration with GitHub Actions
Understanding GitHub Actions
GitHub Actions is an automation platform that runs code (like tests) automatically when certain events happen in your repository, such as pushing new code. This ensures that all code in your repository meets quality standards and helps catch bugs before they reach production.
Creating Your First Action
Create .github/workflows/tests.yml:
name: Run Tests
on:
push:
branches: [ main ]
pull_request:
branches: [ main ]
jobs:
test:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v3
- name: Set up Python
uses: actions/setup-python@v4
with:
python-version: '3.11'
- name: Install dependencies
run: |
python -m pip install --upgrade pip
pip install -r requirements.txt
- name: Run tests
run: |
pytest tests/ -v
Your Task: Before committing this file, answer these questions: - When will this action run? - What Python version will it use? - What happens if a test fails?
Testing Your Automation
- Commit and push your changes:
- Go to your GitHub repository and click on the "Actions" tab
- Watch your tests run automatically!
Making a Test Fail
Your Task: 1. Temporarily break one of your tests (change an expected value) 2. Commit and push 3. Watch the action fail 4. Fix the test and push again 5. Confirm the action passes
This demonstrates how GitHub Actions catches problems automatically.
Step 12: Putting It All Together
Final Project Structure
Your project should now look like this:
psid-analysis/
βββ .devcontainer/
β βββ devcontainer.json
βββ .github/
β βββ workflows/
β βββ tests.yml
βββ notebooks/
β βββ exploration.ipynb
βββ src/
β βββ camp/
β βββ __init__.py
β βββ analysis.py
βββ tests/
β βββ test_analysis.py
βββ .gitignore
βββ requirements.txt
Creating a Comprehensive Analysis
In your exploration.ipynb, create a comprehensive analysis that:
- Imports your
campmodule - Creates or loads sample economic data
- Uses your
compute_momentsfunction - Interprets the results in economic terms
Your Task: Write a markdown cell explaining what each moment tells us about economic data. For example: - What does high skewness in income data typically indicate? - How might kurtosis be relevant for risk analysis?
Final Commit
Congratulations! π
You've built a complete, professional data analysis workflow including:
- Reproducible environment with Docker containers
- Version control with Git and GitHub
- Modular code with proper Python packaging
- Type safety with type hints
- Quality assurance with automated testing
- Continuous integration with GitHub Actions
Next Steps
This foundation prepares you for real economic research:
- Data Integration: Replace sample data with actual PSID files using
pyreadstat - Advanced Analysis: Add econometric functions (regression, panel data methods)
- Visualization: Create publication-ready plots with matplotlib/seaborn
- Documentation: Add docstrings and generate documentation with Sphinx
- Collaboration: Use branches and pull requests for team research
Key Takeaways
- Automation saves time: Setting up proper tooling takes effort upfront but saves hours later
- Testing prevents bugs: Small bugs in economic analysis can lead to wrong policy conclusions
- Version control enables collaboration: Essential for reproducible research
- Modular code is reusable: Functions you write for one project can be used in others
Remember: these tools aren't just for software developersβthey're essential for any researcher working with data in the modern era.