Instructions for working with VoiceMoodAnalyzer, a multi-stage AI pipeline analyzing emotional state from voice recordings using local Whisper.cpp, Wav2Vec2, and DistilRoBERTa models
This skill provides guidance for working with VoiceMoodAnalyzer, a containerized AI pipeline that analyzes emotional state from voice recordings.
VoiceMoodAnalyzer combines three local AI models to analyze emotions:
1. **Audio Transcription**: Whisper.cpp (small model, ~466MB, 6x realtime, no internet required)
2. **Audio Emotion Detection**: Wav2Vec2 model (97.5% accuracy, 7 emotions)
3. **Text Sentiment Analysis**: DistilRoBERTa model
4. **Emotion Fusion**: Database-driven matrix lookup
All models run locally with no external API dependencies. The system is fully containerized with Docker for Azure VM deployment.
**Tech Stack:**
**Database Connection:**
1. **Docker environment (primary method):**
```bash
# Start all services (backend, frontend)
docker-compose up -d --build
# View logs
docker-compose logs -f
docker-compose logs -f backend
docker-compose logs -f frontend
```
2. **Database initialization (manual, one-time):**
```bash
psql -h localhost -p 5436 -U postgres -d mito_books -f db/init/01-init-tables.sql
psql -h localhost -p 5436 -U postgres -d mito_books -f db/init/02-seed-fusion-matrix.sql
```
3. **First run behavior:**
- Hugging Face models download (~2GB) on startup
- Whisper.cpp model downloads (~466MB) on first transcription
- Total: ~2.5GB, takes 10-15 minutes initially
- Subsequent starts: <30 seconds
**Restart services:**
```bash
docker-compose restart backend
docker-compose restart frontend
```
**Full reset (including database):**
```bash
docker-compose down -v
docker-compose up -d --build
```
**Rebuild after code changes:**
```bash
docker-compose up -d --build
```
**Database operations:**
```bash
docker-compose exec postgres psql -U postgres -d mito_books
docker-compose exec postgres pg_dump -U postgres mito_books > backup.sql
docker-compose exec -T postgres psql -U postgres mito_books < backup.sql
```
**Testing API:**
```bash
./test_api.sh
curl http://localhost:8000/
curl http://localhost:8000/api/matrix
curl -X POST http://localhost:8000/api/analyze -F "[email protected]"
```
**Backend:**
```bash
cd backend
python -m venv venv
source venv/bin/activate
pip install -r requirements.txt
uvicorn app:app --reload --host 0.0.0.0 --port 8000
```
**Frontend:**
```bash
cd frontend
npm install
npm run dev # Port 3000
npm run build
npm run preview
```
The analysis follows this sequence in `backend/app.py`:
1. **Upload & Validation** (`app.py:analyze_voice`)
- Validate <25MB, formats: .wav, .mp3, .m4a, .ogg, .flac, .webm
- Save to temp file
2. **Whisper Transcription** (`services/whisper_cpp_service.py`)
- Local transcription, no internet, 6x realtime
- Returns: `transcribed_text: str`
3. **Audio Emotion Detection** (`services/audio_emotion.py`)
- **Duration-based**: Only for recordings ≤15 seconds
- Recordings >15s: Skip (default "neutral" with 0.0 confidence)
- Model: `r-f/wav2vec-english-speech-emotion-recognition` (97.5% accuracy)
- Resample to 16kHz mono
- Returns: `audio_emotion: str`, `audio_confidence: float`
- 7 emotions: angry, disgust, fear, happy, neutral, sad, surprise
4. **Text Emotion Detection** (`services/text_emotion.py`)
- Model: `j-hartmann/emotion-english-distilroberta-base`
- Tokenize, truncate to 512 tokens
- Returns: `text_emotion: str`, `text_confidence: float`
5. **Emotion Fusion** (`services/fusion_service.py`)
- Database lookup in `voice_matrix` table
- Composite key: (audio_emotion, text_emotion)
- Fallback: neutral+neutral → "Unknown"
- Returns: `{final_mood, emoji, description}`
6. **Database Persistence** (`models/voice_analysis.py`)
- Save to `voice_analysis` table (append-only audit)
**Singleton Pattern (memory efficiency):**
**Audio Emotion Model (Current):**
1. Edit `db/init/02-seed-fusion-matrix.sql`
2. Add INSERT with new (audio_emotion, text_emotion) pair:
```sql
INSERT INTO voice_matrix (audio_emotion, text_emotion, final_mood, emoji, description) VALUES
('happy', 'excited', 'Extremely Enthusiastic', '🤩', 'High energy and excitement.');
```
3. Rebuild: `docker-compose down -v && docker-compose up -d`
**Audio Model** (`services/audio_emotion.py`):
```python
self.model_name = "ehcalabres/wav2vec2-lg-xlsr-en-speech-emotion-recognition"
```
**Text Model** (`services/text_emotion.py`):
```python
self.model_name = "cardiffnlp/twitter-roberta-base-emotion"
```
Change in three places:
1. `backend/core/config.py`: `MAX_UPLOAD_SIZE = 25 * 1024 * 1024`
2. `frontend/nginx.conf`: `client_max_body_size 25M;`
3. `frontend/src/components/FileUploader.tsx`: Update UI text
**Models downloading slowly:**
**Database connection refused:**
**CORS errors:**
**Transcription errors:**
**Requirements:**
**Production Checklist:**
1. Change `POSTGRES_PASSWORD` in `.env`
2. Set `allow_origins` in `app.py` to specific domain
3. Add SSL (Let's Encrypt), update nginx.conf
4. Set up systemd service for auto-start
5. Configure firewall (UFW)
6. Enable Docker BuildKit
7. Set up database backups (pg_dump cron)
When working with this codebase:
1. **Always use Docker** for development unless specifically asked otherwise
2. **Check logs first** when debugging: `docker-compose logs -f [service]`
3. **Remember the database is external** (not in Docker) - don't try to manage it via docker-compose
4. **Models are cached** - first run is slow, subsequent runs are fast
5. **No API keys needed** - all models run locally
6. **Audio emotion skips for >15s recordings** - this is intentional for performance
7. **Fusion matrix is database-driven** - modifications require SQL changes + rebuild
8. **Frontend changes require rebuild**: `docker-compose up -d --build frontend`
9. **Backend changes hot-reload** in development mode
10. **Always validate uploads** - max 25MB, specific formats only
Leave a review
No reviews yet. Be the first to review this skill!
# Download SKILL.md from killerskills.ai/api/skills/voice-mood-analyzer-setup/raw