YouTube Audio QA Pipeline

This skill enables GitHub Copilot to assist with **shadowchat**, a YouTube Audio QA system that provides a complete local pipeline for downloading audio from YouTube, transcribing it with Whisper, indexing transcripts in a vector database, and enabling question-answering with citations and timestamps. The system runs entirely locally without requiring external API costs (except for optional OpenAI LLM features).

What This Skill Does

When working on the shadowchat repository, Copilot will understand:

The complete pipeline architecture (download → transcribe → index → query)

Proper environment setup with Python virtual environments

Command-line interface structure and subcommands

Data flow through the audio/transcripts/index directories

Dependencies including yt-dlp, faster-whisper, ChromaDB, and sentence-transformers

Common troubleshooting patterns for network timeouts, CUDA issues, and model downloads

Best practices for incremental processing and citation formatting

Repository Overview

**Type**: Python CLI application

**Languages**: Python 3.12+

**Key Technologies**: yt-dlp, faster-whisper, ChromaDB, sentence-transformers, OpenAI API (optional), Rich CLI

Core Architecture

```

shadowchat/

├── main.py # CLI entry point with subcommands

├── yt_audio_ai/ # Core package

│ ├── downloader.py # YouTube audio downloading (yt-dlp)

│ ├── transcriber.py # Audio → text transcription (faster-whisper)

│ ├── indexer.py # Text → vector database (ChromaDB)

│ ├── qa.py # Question answering and LLM analysis

│ ├── llm.py # OpenAI API interface

│ └── utils.py # Shared utilities

├── requirements.txt # Python dependencies

├── .env.example # Environment variable template

└── data/ # Generated at runtime

├── audio/ # Downloaded MP3 files + metadata

├── transcripts/ # Whisper transcription JSON files

└── index/ # ChromaDB persistent vector database

```

Step-by-Step Instructions

1. Environment Setup

**ALWAYS follow this exact sequence:**

```bash

Create virtual environment (required)

python -m venv .venv

Activate virtual environment

Windows PowerShell:

.\.venv\Scripts\Activate.ps1

Linux/Mac:

source .venv/bin/activate

Install dependencies (expect 5-10 minutes)

pip install -r requirements.txt

If network timeouts occur, use:

pip install --timeout 1000 -r requirements.txt

Set up environment variables (required for LLM features)

cp .env.example .env

Edit .env and add: OPENAI_API_KEY=sk-your-key-here

```

**Known Installation Issues**:

Network timeouts are common - retry if pip fails

CUDA dependencies are large (~2GB) - use `--timeout 1000` if needed

On first run, Whisper models will download automatically (~100MB-1.5GB per model)

2. Basic Validation

Test that the CLI is working:

```bash

Verify CLI is accessible

python main.py --help

Check database state (should show "Total sessions: 0" on fresh install)

python main.py stats

```

3. Complete Pipeline Test

Test the full pipeline with a short video:

```bash

Step 1: Download audio (creates data/audio/)

python main.py download --url "https://www.youtube.com/watch?v=dQw4w9WgXcQ"

Step 2: Transcribe audio (creates data/transcripts/, may take 1-5 minutes)

python main.py transcribe --model-size small.en

Step 3: Index transcripts (creates data/index/)

python main.py index

Step 4: Query the database

python main.py ask --question "What is being said?"

Step 5: Verify stats

python main.py stats

```

**Expected Timing**:

**Download**: 30 seconds - 5 minutes (depends on video length/quality)

**Transcribe**: 1-10 minutes (depends on audio length and model size)

**Index**: 10-60 seconds (depends on transcript length)

**Query**: 1-5 seconds for retrieval, 10-30 seconds if using LLM analysis

4. Main Commands Reference

All commands are accessed via `python main.py <subcommand>`:

**download**: YouTube URLs → MP3 + metadata

```bash

python main.py download --url "https://youtube.com/watch?v=..."

python main.py download --urls-file urls.txt

python main.py download --playlist-url "https://youtube.com/playlist?list=..."

```

**transcribe**: MP3 → JSON transcripts

```bash

python main.py transcribe --model-size small.en # Fast, English-only

python main.py transcribe --model-size medium # Balanced

python main.py transcribe --model-size large-v3 --device cpu # Best quality

```

**index**: Transcripts → vector database

```bash

python main.py index

python main.py index --collection custom_name

python main.py index --embedding-model "sentence-transformers/all-MiniLM-L6-v2"

```

**ask**: Query vector database

```bash

python main.py ask --question "What did they say about X?"

python main.py ask --question "Summary please" --top-k 10

```

**chat**: Interactive Q&A mode

```bash

python main.py chat

```

**stats**: Show database statistics

```bash

python main.py stats

```

5. Data Flow Understanding

The system processes data through these stages:

1. **Input**: YouTube URLs or playlists

2. **Download**: yt-dlp extracts audio → `data/audio/{video_id}.mp3` + `{video_id}.info.json`

3. **Transcribe**: faster-whisper → `data/transcripts/{video_id}.json` (with timestamps)

4. **Index**: sentence-transformers embeddings → `data/index/` (ChromaDB persistent store)

5. **Query**: Natural language → semantic search with timestamp citations

**Key Features**:

**Incremental processing**: Re-runs skip existing files automatically

**Chunking strategy**: Transcripts split into ~1500 character chunks with timestamp preservation

**Citation format**: Results include YouTube URLs with timestamp parameters (`?t=123`)

**Collection support**: Multiple vector databases can coexist with `--collection` flag

**Model consistency**: Always use same embedding model for indexing and querying

Common Troubleshooting Patterns

Installation Issues

1. **Network timeouts during pip install**: Retry with `pip install --timeout 1000 -r requirements.txt`

2. **CUDA memory errors**: Use `--device cpu` flag for transcription

3. **Permission errors on data/**: Ensure write permissions to repository directory

4. **Missing .env**: Copy `.env.example` to `.env` (LLM features won't work without API key)

5. **Model download failures**: Whisper models auto-download on first use - ensure internet connection

Runtime Issues

1. **Memory issues**: Large Whisper models may OOM on smaller systems - use smaller model sizes

2. **Path issues**: Windows path separators, permission errors on data/ directory

3. **API key missing**: OpenAI features fail silently without proper .env setup

4. **Model mismatch**: Changing embedding models requires full re-indexing

5. **Disk space**: Audio files, models, and vector databases consume significant storage

Important Constraints

**Always use virtual environments** - dependency conflicts are common with ML packages

**Test incrementally** - run each pipeline step individually to isolate issues

**Use small test videos** - validate pipeline before processing large content libraries

**Monitor disk space** - audio files, Whisper models (100MB-1.5GB), and vector databases accumulate quickly

**First run takes longer** - Whisper models download automatically on first transcription

**GPU is optional** - System works entirely on CPU, but transcription is significantly faster with CUDA

Examples

Example 1: Process a Single YouTube Video

```bash

Download audio

python main.py download --url "https://www.youtube.com/watch?v=dQw4w9WgXcQ"

Transcribe with small English-only model (fastest)

python main.py transcribe --model-size small.en

Index into default collection

python main.py index

Ask questions

python main.py ask --question "What is the main topic?"

```

Example 2: Process a YouTube Playlist

```bash

Download entire playlist

python main.py download --playlist-url "https://www.youtube.com/playlist?list=PLrAXtmErZgOeiKm4sgNOknGvNjby9efdf"

Transcribe all audio files with medium model

python main.py transcribe --model-size medium

Index into custom collection

python main.py index --collection my_playlist

Interactive chat mode

python main.py chat

```

Example 3: Batch Processing from URL File

```bash

Create urls.txt with one YouTube URL per line

echo "https://www.youtube.com/watch?v=..." > urls.txt

echo "https://www.youtube.com/watch?v=..." >> urls.txt

Download all

python main.py download --urls-file urls.txt

Process pipeline

python main.py transcribe --model-size small.en

python main.py index

Query with more context

python main.py ask --question "Compare the topics across videos" --top-k 20

```

Recommended Development Workflow

When making changes to the codebase:

1. **Set up environment** (venv + pip install)

2. **Test with single short video** end-to-end

3. **Verify each command** works independently

4. **Use `stats` command** to validate database state

5. **For code changes**: Test full cycle (download → transcribe → index → query)

6. **Check data/ directory** structure matches expectations

7. **Monitor console output** - Rich CLI provides detailed progress information

Notes

**No formal test suite exists** - validation is done through manual pipeline testing

**ChromaDB creates SQLite files** in `data/index/` for persistent storage

**Citation URLs include timestamps** for easy navigation: `youtube.com/watch?v=ID&t=123`

**Windows-friendly** - all paths and commands tested on Windows PowerShell

**Local-first design** - no external API calls except optional OpenAI LLM analysis

YouTube Audio QA Pipeline

YouTube Audio QA Pipeline

What This Skill Does

Repository Overview

Core Architecture

Step-by-Step Instructions

1. Environment Setup

Create virtual environment (required)

Activate virtual environment

Windows PowerShell:

Linux/Mac:

Install dependencies (expect 5-10 minutes)

If network timeouts occur, use:

Set up environment variables (required for LLM features)

Edit .env and add: OPENAI_API_KEY=sk-your-key-here

2. Basic Validation

Verify CLI is accessible

Check database state (should show "Total sessions: 0" on fresh install)

3. Complete Pipeline Test

Step 1: Download audio (creates data/audio/)

Step 2: Transcribe audio (creates data/transcripts/, may take 1-5 minutes)

Step 3: Index transcripts (creates data/index/)

Step 4: Query the database

Step 5: Verify stats

4. Main Commands Reference

5. Data Flow Understanding

Common Troubleshooting Patterns

Installation Issues

Runtime Issues

Important Constraints

Examples

Example 1: Process a Single YouTube Video

Download audio

Transcribe with small English-only model (fastest)

Index into default collection

Ask questions

Example 2: Process a YouTube Playlist

Download entire playlist

Transcribe all audio files with medium model

Index into custom collection

Interactive chat mode

Example 3: Batch Processing from URL File

Create urls.txt with one YouTube URL per line

Download all

Process pipeline

Query with more context

Recommended Development Workflow

Notes

Reviews (0)