Docker Whisper Transcription Service
A fully self-contained, Docker-based transcription service that automatically converts audio files to text using OpenAI's Whisper AI model. This skill helps you set up, configure, and operate a local speech-to-text pipeline with no external API dependencies.
What This Skill Does
This skill provides guidance for building and operating a containerized transcription service that:
Monitors a directory for incoming audio files (MP3, WAV, M4A, etc.)Automatically transcribes audio using Whisper AISaves results in both plain text and JSON formatsMoves processed files to an archive directoryRuns entirely offline after initial setupProject Structure
The service uses a simple directory-based architecture:
```
/docker-transcriptions/
├── docker-compose.yml # Container orchestration
├── app/
│ └── transcribe.py # Transcription logic
├── data/
│ ├── uploads/ # Drop audio files here
│ │ └── processed/ # Processed files archive
│ └── transcriptions/ # Output text/JSON files
└── docs/
└── PROJECT_MAP.md # Architecture documentation
```
Setup Instructions
1. Prerequisites
Ensure the following are installed:
DockerDocker ComposeSufficient disk space (Whisper models + audio files)At least 4GB RAM recommended2. Initial Configuration
The `docker-compose.yml` should define:
Python-based container with Whisper dependenciesVolume mounts for `data/uploads` and `data/transcriptions`Environment variables for model selection (tiny/base/small/medium/large)3. Transcription Script (`app/transcribe.py`)
The core script should:
Watch the `uploads/` directory for new filesLoad the Whisper model on startupProcess each audio file and generate transcriptionsSave output in both `.txt` and `.json` formatsMove source files to `uploads/processed/` after completionHandle errors gracefully and log processing statusUsage Workflow
1. **Start the service:**
```bash
docker-compose up -d
```
2. **Add audio files:**
Copy or move audio files to `data/uploads/`
3. **Monitor progress:**
```bash
docker-compose logs -f
```
4. **Retrieve results:**
Check `data/transcriptions/` for output files
5. **Stop the service:**
```bash
docker-compose down
```
Key Implementation Details
File Monitoring
Use watchdog library or simple polling loopCheck for common audio extensions: `.mp3`, `.wav`, `.m4a`, `.flac`, `.ogg`Implement file stability check (wait for copy completion)Transcription Output
**Text format:** Plain transcript in `.txt` file**JSON format:** Structured output with timestamps, segments, and confidence scoresError Handling
Log unsupported file formatsHandle corrupted audio files gracefullyRetry logic for temporary failuresClear error messages in logsModel Selection
**tiny:** Fastest, lowest accuracy (~1GB RAM)**base:** Good balance (~1GB RAM)**small:** Better accuracy (~2GB RAM)**medium:** High accuracy (~5GB RAM)**large:** Best accuracy (~10GB RAM)Development & Customization
Modifying Transcription Behavior
Edit `app/transcribe.py` to:
Change Whisper model sizeAdd language hintsCustomize output formatsImplement post-processing filtersAdjusting Container Configuration
Edit `docker-compose.yml` to:
Allocate more/less resourcesChange volume mount pathsAdd environment variablesConfigure restart policiesTesting Changes
```bash
Rebuild and restart after code changes
docker-compose down
docker-compose build --no-cache
docker-compose up -d
```
Deployment Considerations
Resource Planning
Larger models require more RAM and processing timeSSD storage recommended for model loading speedConsider GPU support for faster transcription (requires NVIDIA Docker runtime)Data Persistence
All data is stored in Docker volumesBackup `data/` directory regularlyProcessed files remain in `uploads/processed/` until manually deletedSecurity
Service runs entirely offline (no external API calls)Audio files remain on your local systemNo data transmission to third partiesTroubleshooting
Container won't start
Check Docker logs: `docker-compose logs`Verify sufficient disk space and memoryEnsure no port conflictsFiles not processing
Check file permissions on `data/` directoriesVerify audio format is supportedReview logs for error messagesSlow transcription
Consider using a smaller Whisper modelCheck system resource usageVerify no other intensive processes runningExample Use Cases
Transcribe podcast episodes for searchable archivesConvert meeting recordings to textGenerate subtitles for video contentProcess voice memos and interviewsCreate accessibility transcripts for audio contentIntegration Points
This service can be integrated with:
File sync tools (Dropbox, Syncthing) for remote uploadsAutomation tools (cron, systemd timers) for scheduled processingWeb interfaces for upload/download managementNotification systems for completion alertsFurther Enhancements
Consider adding:
Speaker diarization (identify different speakers)Language detection and multi-language supportBatch processing with priority queuesWeb UI for file managementAPI endpoint for programmatic accessWebhook notifications on completion