End-to-end local pipeline for downloading YouTube audio, transcribing with Whisper, indexing in vector DB, and enabling question-answering with citations and timestamps.
This skill enables GitHub Copilot to assist with **shadowchat**, a YouTube Audio QA system that provides a complete local pipeline for downloading audio from YouTube, transcribing it with Whisper, indexing transcripts in a vector database, and enabling question-answering with citations and timestamps. The system runs entirely locally without requiring external API costs (except for optional OpenAI LLM features).
When working on the shadowchat repository, Copilot will understand:
**Type**: Python CLI application
**Languages**: Python 3.12+
**Key Technologies**: yt-dlp, faster-whisper, ChromaDB, sentence-transformers, OpenAI API (optional), Rich CLI
```
shadowchat/
├── main.py # CLI entry point with subcommands
├── yt_audio_ai/ # Core package
│ ├── downloader.py # YouTube audio downloading (yt-dlp)
│ ├── transcriber.py # Audio → text transcription (faster-whisper)
│ ├── indexer.py # Text → vector database (ChromaDB)
│ ├── qa.py # Question answering and LLM analysis
│ ├── llm.py # OpenAI API interface
│ └── utils.py # Shared utilities
├── requirements.txt # Python dependencies
├── .env.example # Environment variable template
└── data/ # Generated at runtime
├── audio/ # Downloaded MP3 files + metadata
├── transcripts/ # Whisper transcription JSON files
└── index/ # ChromaDB persistent vector database
```
**ALWAYS follow this exact sequence:**
```bash
python -m venv .venv
.\.venv\Scripts\Activate.ps1
source .venv/bin/activate
pip install -r requirements.txt
pip install --timeout 1000 -r requirements.txt
cp .env.example .env
```
**Known Installation Issues**:
Test that the CLI is working:
```bash
python main.py --help
python main.py stats
```
Test the full pipeline with a short video:
```bash
python main.py download --url "https://www.youtube.com/watch?v=dQw4w9WgXcQ"
python main.py transcribe --model-size small.en
python main.py index
python main.py ask --question "What is being said?"
python main.py stats
```
**Expected Timing**:
All commands are accessed via `python main.py <subcommand>`:
**download**: YouTube URLs → MP3 + metadata
```bash
python main.py download --url "https://youtube.com/watch?v=..."
python main.py download --urls-file urls.txt
python main.py download --playlist-url "https://youtube.com/playlist?list=..."
```
**transcribe**: MP3 → JSON transcripts
```bash
python main.py transcribe --model-size small.en # Fast, English-only
python main.py transcribe --model-size medium # Balanced
python main.py transcribe --model-size large-v3 --device cpu # Best quality
```
**index**: Transcripts → vector database
```bash
python main.py index
python main.py index --collection custom_name
python main.py index --embedding-model "sentence-transformers/all-MiniLM-L6-v2"
```
**ask**: Query vector database
```bash
python main.py ask --question "What did they say about X?"
python main.py ask --question "Summary please" --top-k 10
```
**chat**: Interactive Q&A mode
```bash
python main.py chat
```
**stats**: Show database statistics
```bash
python main.py stats
```
The system processes data through these stages:
1. **Input**: YouTube URLs or playlists
2. **Download**: yt-dlp extracts audio → `data/audio/{video_id}.mp3` + `{video_id}.info.json`
3. **Transcribe**: faster-whisper → `data/transcripts/{video_id}.json` (with timestamps)
4. **Index**: sentence-transformers embeddings → `data/index/` (ChromaDB persistent store)
5. **Query**: Natural language → semantic search with timestamp citations
**Key Features**:
1. **Network timeouts during pip install**: Retry with `pip install --timeout 1000 -r requirements.txt`
2. **CUDA memory errors**: Use `--device cpu` flag for transcription
3. **Permission errors on data/**: Ensure write permissions to repository directory
4. **Missing .env**: Copy `.env.example` to `.env` (LLM features won't work without API key)
5. **Model download failures**: Whisper models auto-download on first use - ensure internet connection
1. **Memory issues**: Large Whisper models may OOM on smaller systems - use smaller model sizes
2. **Path issues**: Windows path separators, permission errors on data/ directory
3. **API key missing**: OpenAI features fail silently without proper .env setup
4. **Model mismatch**: Changing embedding models requires full re-indexing
5. **Disk space**: Audio files, models, and vector databases consume significant storage
```bash
python main.py download --url "https://www.youtube.com/watch?v=dQw4w9WgXcQ"
python main.py transcribe --model-size small.en
python main.py index
python main.py ask --question "What is the main topic?"
```
```bash
python main.py download --playlist-url "https://www.youtube.com/playlist?list=PLrAXtmErZgOeiKm4sgNOknGvNjby9efdf"
python main.py transcribe --model-size medium
python main.py index --collection my_playlist
python main.py chat
```
```bash
echo "https://www.youtube.com/watch?v=..." > urls.txt
echo "https://www.youtube.com/watch?v=..." >> urls.txt
python main.py download --urls-file urls.txt
python main.py transcribe --model-size small.en
python main.py index
python main.py ask --question "Compare the topics across videos" --top-k 20
```
When making changes to the codebase:
1. **Set up environment** (venv + pip install)
2. **Test with single short video** end-to-end
3. **Verify each command** works independently
4. **Use `stats` command** to validate database state
5. **For code changes**: Test full cycle (download → transcribe → index → query)
6. **Check data/ directory** structure matches expectations
7. **Monitor console output** - Rich CLI provides detailed progress information
Leave a review
No reviews yet. Be the first to review this skill!
# Download SKILL.md from killerskills.ai/api/skills/youtube-audio-qa-pipeline/raw