OSS Impact Analysis

Measure observable changes in open source software development patterns before and after widespread AI-assisted coding adoption (Oct 2022 - present), using GitHub API data from major repositories.

What This Skill Does

This skill helps you analyze GitHub repositories to detect changes in development patterns after AI coding tools became widespread. It collects commits, PRs, issues, and releases from multiple repositories, performs statistical analysis, and generates interactive visualizations.

Prerequisites

Python 3.13+ installed

GitHub personal access token (for API access)

Basic familiarity with Python data analysis

Instructions

Initial Setup

1. **Install dependencies**

```bash

uv sync --all-extras

```

2. **Configure pre-commit hooks**

```bash

uv run pre-commit install

uv run pre-commit install --hook-type commit-msg

```

3. **Set up environment variables**

- Create `.env.local` file

- Add `GITHUB_TOKEN=your_token_here`

- Optional overrides: `TEST_REPO`, `DATE_RANGE_START`, `DATE_RANGE_END`, `COLLECT_REPOS`

Data Collection Workflow

The project uses a **two-sweep collection** strategy:

**Sweep 1 - Git (Fast, No API Limits)**

Clones repositories locally

Extracts commits from git log

Extracts releases from git tags

```bash

uv run python scripts/collect_all.py --sweep git

```

**Sweep 2 - API (GraphQL, Rate-Limited)**

Fetches PRs via GraphQL (100 items/page)

Fetches issues via GraphQL (100 items/page)

Builds contributor data from PRs

```bash

uv run python scripts/collect_all.py --sweep api

```

**Or run both sweeps together:**

```bash

uv run python scripts/collect_all.py --sweep all

```

Collection Options

```bash

Preview repositories without collecting

uv run python scripts/collect_all.py --dry-run

Force refresh all data (bypass cache)

uv run python scripts/collect_all.py --force-refresh

Collect specific repository only

uv run python scripts/collect_all.py --repo owner/repo

Re-collect stale repos (not updated in N days)

uv run python scripts/collect_all.py --stale-days 30

```

Adding New Repositories

1. Edit `src/config.py` and add to `CLUSTER_MAP`:

```python

CLUSTER_MAP = {

"owner/repo": "cluster-name",

}

```

2. Run collection for the new repo:

```bash

uv run python scripts/collect_all.py --repo owner/repo

```

3. Validate data quality:

```bash

uv run python scripts/validate_data.py

```

Data Validation

Run comprehensive data quality checks:

```bash

Standard validation

uv run python scripts/validate_data.py

Verbose mode (includes monthly gap details)

uv run python scripts/validate_data.py --verbose

```

Generate Analysis Summary

Create machine-readable findings:

```bash

uv run python scripts/generate_summary.py

```

This generates `output/findings_summary.json` with statistical results.

Interactive Dashboard

Launch the Streamlit dashboard for visual exploration:

```bash

uv run streamlit run dashboard/app.py

```

Dashboard pages:

**Overview**: Cohort-level KPIs, effect size heatmap, top findings

**Repo Explorer**: Single repo deep-dive with time series and statistical tests

**Cluster Analysis**: Within/across cluster comparisons

**Statistical Analysis**: Full statistical exploration with configurable alpha

**AI Adoption**: AI mention tracking, rankings, correlations

**Data Model**: ERD diagram, table schemas, metric documentation

Jupyter Notebook Analysis

For interactive exploration:

```bash

uv run jupyter notebook notebooks/01_explore_single_repo.ipynb

```

Available notebooks:

`01_explore_single_repo.ipynb` - Single-repo exploration

`02_cohort_overview.ipynb` - Cohort validation

`03_statistical_tests.ipynb` - Statistical analysis

Code Quality

```bash

Lint code

uv run ruff check .

Format code

uv run ruff format .

Run tests

uv run pytest

```

Project Structure

```

src/

config.py # Settings, constants, repo clusters

models.py # Pydantic data models

storage.py # SQLite operations

collector.py # Collection orchestration

git_collector.py # Local git operations

graphql_collector.py # GitHub GraphQL API

charts.py # Plotly visualizations

metrics.py # Statistical analysis (35+ functions)

analytics/ # Dashboard analytics layer

data_service.py # Data loading, caching

repo_analytics.py # Single-repo analysis

cohort_analytics.py # Multi-repo aggregations

comparison.py # Repo vs repo comparisons

derived_metrics.py # Computed metrics

caching.py # LRU cache utilities

dashboard/

app.py # Streamlit entry point

pages/ # Dashboard pages

components/ # Reusable UI components

data/ # SQLite database (gitignored)

repos/ # Git clones (gitignored)

output/charts/ # Exported visualizations

```

Key Concepts

**Primary breakpoint**: 2022-10 (GitHub Copilot GA + ChatGPT launch)

**Analysis window**: 2020-01 to 2025-12

**Monthly aggregation**: All metrics stored at month granularity (YYYY-MM)

**AI keywords**: copilot, chatgpt, gpt-4, claude, cursor, etc. (defined in `src/config.py`)

Data Model

Core tables:

`repos` - Repository metadata and collection status

`commits_monthly` - Aggregated commit metrics

`prs_monthly` - Pull request metrics

`issues_monthly` - Issue metrics

`releases` - Release events

`contributors` - Contributor activity

Collection status flow:

```

pending → git_in_progress → git_complete → api_in_progress → completed

```

GitHub API Rate Limits

REST API: 5,000 requests/hour

GraphQL API: 5,000 points/hour

Estimated time for full cohort (21 repos):

Git sweep: ~15-30 minutes (parallel, no rate limits)

API sweep: ~1-2 hours (3,000-5,000 API calls)

Analytics Functions

The `src/analytics/` module provides UI-agnostic functions:

```python

from src.analytics import load_repo_data, get_repo_summary, compare_repos

Load repo data

repo_data = load_repo_data(repo_id)

summary = get_repo_summary(repo_data)

Compare repos

comparison = compare_repos(repo_a, repo_b, metrics=("commits", "prs_merged"))

```

Caching Strategy

Two-layer caching for performance:

Analytics layer: `@lru_cache` for single-repo, `@cohort_cached` for multi-repo

Dashboard layer: `@st.cache_data(ttl=3600)` for page-level data

Clear all analytics caches:

```python

from src.analytics.caching import clear_all_analytics_caches

clear_all_analytics_caches()

```

Important Notes

Git clones cached in `repos/` directory (default `max_age_days=7`)

SQLite data cached with `collected_at` timestamp

Use `--force-refresh` to bypass all caching

Collection progress is checkpointed after each GraphQL page (worst case loss: ~100 items)

All monthly data uses YYYY-MM format

Dashboard supports multiple breakpoint presets for comparison

Output Files

`data/github_metrics.db` - SQLite database with all collected data

`output/findings_summary.json` - Machine-readable analysis results

`output/charts/*.html` - Interactive Plotly visualizations

OSS Impact Analysis

OSS Impact Analysis

What This Skill Does

Prerequisites

Instructions

Initial Setup

Data Collection Workflow

Collection Options

Preview repositories without collecting

Force refresh all data (bypass cache)

Collect specific repository only

Re-collect stale repos (not updated in N days)

Adding New Repositories

Data Validation

Standard validation

Verbose mode (includes monthly gap details)

Generate Analysis Summary

Interactive Dashboard

Jupyter Notebook Analysis

Code Quality

Lint code

Format code

Run tests

Project Structure

Key Concepts

Data Model

GitHub API Rate Limits

Analytics Functions

Load repo data

Compare repos

Caching Strategy

Important Notes

Output Files

Reviews (0)