Voice Mood Analyzer Setup

This skill provides guidance for working with VoiceMoodAnalyzer, a containerized AI pipeline that analyzes emotional state from voice recordings.

System Overview

VoiceMoodAnalyzer combines three local AI models to analyze emotions:

1. **Audio Transcription**: Whisper.cpp (small model, ~466MB, 6x realtime, no internet required)

2. **Audio Emotion Detection**: Wav2Vec2 model (97.5% accuracy, 7 emotions)

3. **Text Sentiment Analysis**: DistilRoBERTa model

4. **Emotion Fusion**: Database-driven matrix lookup

All models run locally with no external API dependencies. The system is fully containerized with Docker for Azure VM deployment.

Architecture

**Tech Stack:**

Backend: FastAPI (Python) with uvicorn

Frontend: React/Next.js with nginx

Database: PostgreSQL (external, not in Docker)

Models: Hugging Face Transformers + whisper.cpp

**Database Connection:**

Host: `localhost` (or `host.docker.internal` in Docker)

Port: `5436`

Database: `mito_books`

User: `postgres` / Password: `123`

Connection: `postgresql://postgres:123@localhost:5436/mito_books`

Development Workflow

Starting the Application

1. **Docker environment (primary method):**

```bash

# Start all services (backend, frontend)

docker-compose up -d --build

# View logs

docker-compose logs -f

docker-compose logs -f backend

docker-compose logs -f frontend

```

2. **Database initialization (manual, one-time):**

```bash

psql -h localhost -p 5436 -U postgres -d mito_books -f db/init/01-init-tables.sql

psql -h localhost -p 5436 -U postgres -d mito_books -f db/init/02-seed-fusion-matrix.sql

```

3. **First run behavior:**

- Hugging Face models download (~2GB) on startup

- Whisper.cpp model downloads (~466MB) on first transcription

- Total: ~2.5GB, takes 10-15 minutes initially

- Subsequent starts: <30 seconds

Common Operations

**Restart services:**

```bash

docker-compose restart backend

docker-compose restart frontend

```

**Full reset (including database):**

```bash

docker-compose down -v

docker-compose up -d --build

```

**Rebuild after code changes:**

```bash

docker-compose up -d --build

```

**Database operations:**

```bash

Access PostgreSQL shell

docker-compose exec postgres psql -U postgres -d mito_books

Backup

docker-compose exec postgres pg_dump -U postgres mito_books > backup.sql

Restore

docker-compose exec -T postgres psql -U postgres mito_books < backup.sql

```

**Testing API:**

```bash

./test_api.sh

Or manually:

curl http://localhost:8000/

curl http://localhost:8000/api/matrix

curl -X POST http://localhost:8000/api/analyze -F "[email protected]"

```

Local Development (Without Docker)

**Backend:**

```bash

cd backend

python -m venv venv

source venv/bin/activate

pip install -r requirements.txt

uvicorn app:app --reload --host 0.0.0.0 --port 8000

```

**Frontend:**

```bash

cd frontend

npm install

npm run dev # Port 3000

npm run build

npm run preview

```

Request Pipeline (POST /api/analyze)

The analysis follows this sequence in `backend/app.py`:

1. **Upload & Validation** (`app.py:analyze_voice`)

- Validate <25MB, formats: .wav, .mp3, .m4a, .ogg, .flac, .webm

- Save to temp file

2. **Whisper Transcription** (`services/whisper_cpp_service.py`)

- Local transcription, no internet, 6x realtime

- Returns: `transcribed_text: str`

3. **Audio Emotion Detection** (`services/audio_emotion.py`)

- **Duration-based**: Only for recordings ≤15 seconds

- Recordings >15s: Skip (default "neutral" with 0.0 confidence)

- Model: `r-f/wav2vec-english-speech-emotion-recognition` (97.5% accuracy)

- Resample to 16kHz mono

- Returns: `audio_emotion: str`, `audio_confidence: float`

- 7 emotions: angry, disgust, fear, happy, neutral, sad, surprise

4. **Text Emotion Detection** (`services/text_emotion.py`)

- Model: `j-hartmann/emotion-english-distilroberta-base`

- Tokenize, truncate to 512 tokens

- Returns: `text_emotion: str`, `text_confidence: float`

5. **Emotion Fusion** (`services/fusion_service.py`)

- Database lookup in `voice_matrix` table

- Composite key: (audio_emotion, text_emotion)

- Fallback: neutral+neutral → "Unknown"

- Returns: `{final_mood, emoji, description}`

6. **Database Persistence** (`models/voice_analysis.py`)

- Save to `voice_analysis` table (append-only audit)

Model Management

**Singleton Pattern (memory efficiency):**

`get_whisper_service()` - Loads once, cached globally (lazy)

`get_audio_emotion_service()` - Loads once, cached globally

`get_text_emotion_service()` - Loads once, cached globally

Hugging Face models preload on startup: `app.py:41-48`

Whisper.cpp loads on first request

**Audio Emotion Model (Current):**

Model: `r-f/wav2vec-english-speech-emotion-recognition`

Base: `jonatasgrosman/wav2vec2-large-xlsr-53-english`

Accuracy: 97.5% (validation loss: 0.104)

Training: 4,720 samples (SAVEE, RAVDESS, TESS)

License: Apache 2.0

Upgraded from HuBERT (November 2025) for better accuracy

Key Files Reference

**API Routes**: `backend/app.py:50-180`

**Model Loading**: `backend/app.py:41-48` (startup event)

**Whisper Service**: `backend/services/whisper_cpp_service.py`

**Audio Emotion**: `backend/services/audio_emotion.py`

**Text Emotion**: `backend/services/text_emotion.py`

**Fusion Logic**: `backend/services/fusion_service.py:10-68`

**Database Models**: `backend/models/*.py`

**DB Init Scripts**: `db/init/*.sql`

**Frontend API**: `frontend/src/services/api.ts`

**React App**: `frontend/src/App.tsx`

**Nginx Config**: `frontend/nginx.conf`

**Docker**: `docker-compose.yml`

**Environment**: `.env` + `backend/core/config.py`

Customization

Add New Emotions to Fusion Matrix

1. Edit `db/init/02-seed-fusion-matrix.sql`

2. Add INSERT with new (audio_emotion, text_emotion) pair:

```sql

INSERT INTO voice_matrix (audio_emotion, text_emotion, final_mood, emoji, description) VALUES

('happy', 'excited', 'Extremely Enthusiastic', '🤩', 'High energy and excitement.');

```

3. Rebuild: `docker-compose down -v && docker-compose up -d`

Swap ML Models

**Audio Model** (`services/audio_emotion.py`):

```python

self.model_name = "ehcalabres/wav2vec2-lg-xlsr-en-speech-emotion-recognition"

Update emotion_labels to match

```

**Text Model** (`services/text_emotion.py`):

```python

self.model_name = "cardiffnlp/twitter-roberta-base-emotion"

Update emotion_labels and emotion_mapping

```

Adjust Upload Limits

Change in three places:

1. `backend/core/config.py`: `MAX_UPLOAD_SIZE = 25 * 1024 * 1024`

2. `frontend/nginx.conf`: `client_max_body_size 25M;`

3. `frontend/src/components/FileUploader.tsx`: Update UI text

Troubleshooting

**Models downloading slowly:**

First run downloads ~2.5GB

Check: `docker-compose logs -f backend`

Wait 10-15 minutes

**Database connection refused:**

Check health: `docker-compose ps`

Test: `docker-compose exec postgres pg_isready -U postgres`

Restart: `docker-compose restart postgres`

**CORS errors:**

Verify `allow_origins` in `backend/app.py`

Check nginx proxy headers

**Transcription errors:**

Whisper.cpp downloads automatically (~466MB)

Check: `docker-compose logs -f backend`

Models in `/app/.cache/whispercpp_models`

No API key needed

Deployment (Azure VM)

**Requirements:**

Min: 4 vCPUs, 8GB RAM, 50GB disk

Recommended: Standard_D4s_v3+

Ports: 80, 443, 22

No internet required after initial model download

**Production Checklist:**

1. Change `POSTGRES_PASSWORD` in `.env`

2. Set `allow_origins` in `app.py` to specific domain

3. Add SSL (Let's Encrypt), update nginx.conf

4. Set up systemd service for auto-start

5. Configure firewall (UFW)

6. Enable Docker BuildKit

7. Set up database backups (pg_dump cron)

Instructions for Claude Code

When working with this codebase:

1. **Always use Docker** for development unless specifically asked otherwise

2. **Check logs first** when debugging: `docker-compose logs -f [service]`

3. **Remember the database is external** (not in Docker) - don't try to manage it via docker-compose

4. **Models are cached** - first run is slow, subsequent runs are fast

5. **No API keys needed** - all models run locally

6. **Audio emotion skips for >15s recordings** - this is intentional for performance

7. **Fusion matrix is database-driven** - modifications require SQL changes + rebuild

8. **Frontend changes require rebuild**: `docker-compose up -d --build frontend`

9. **Backend changes hot-reload** in development mode

10. **Always validate uploads** - max 25MB, specific formats only

Voice Mood Analyzer Setup

Voice Mood Analyzer Setup

System Overview

Architecture

Development Workflow

Starting the Application

Common Operations

Access PostgreSQL shell

Backup

Restore

Or manually:

Local Development (Without Docker)

Request Pipeline (POST /api/analyze)

Model Management

Key Files Reference

Customization

Add New Emotions to Fusion Matrix

Swap ML Models

Update emotion_labels to match

Update emotion_labels and emotion_mapping

Adjust Upload Limits

Troubleshooting

Deployment (Azure VM)

Instructions for Claude Code

Reviews (0)