vLLM Mixbread Reranker Deployment
Deploy and test the `mixedbread-ai/mxbai-rerank-base-v2` reranker model using vLLM GPU server, ONNX CPU inference, or local Python wrapper. Supports Modal cloud deployment, Docker GPU server, and async benchmarking.
Overview
This skill guides you through deploying and testing a Mixbread Reranker with multiple deployment paths:
**Local testing**: Python wrapper with device selection (CPU/GPU/MPS)**vLLM GPU**: Docker container or Modal cloud deployment**ONNX CPU**: Quantized int8 model for CPU-only Modal deployment**Benchmarking**: Async throughput testing with BEIR datasetsAll commands use `uv run` for ad-hoc dependency injection without lock files.
Key Commands
Local Testing
Test the reranker locally using the Python wrapper:
```bash
Basic test with custom query and docs
make local-run QUERY="your query" DOCS=data/example_docs.json
Apple Silicon GPU (MPS)
make local-run-mps
CPU only
make local-run-cpu
Batch test multiple cases
make batch-run
```
The local runner uses the `mxbai_rerank` package and supports automatic device selection.
vLLM Docker (GPU Required)
Run vLLM reranker server in Docker with GPU acceleration:
```bash
Start vLLM server
make vllm-up-docker
Check server health
make vllm-health
Test /v1/rerank endpoint
make vllm-client
Verify Docker GPU access
make gpu-check
```
**Requirements**: NVIDIA GPU + NVIDIA Container Toolkit. Not supported on macOS/Windows.
ONNX CPU Deployment
Export model to ONNX format and apply int8 quantization:
```bash
Export to ONNX
uv run --with optimum[onnxruntime] --with transformers --with torch \
python scripts/export_onnx.py --model-id mixedbread-ai/mxbai-rerank-base-v2 --out-dir onnx/mxbai-base
Quantize to int8
uv run --with onnxruntime --with onnxruntime-tools \
python scripts/quantize_onnx.py --model-path onnx/mxbai-base/model.onnx --out-path onnx/mxbai-base/model-int8.onnx
Run local ONNX inference
uv run --with onnxruntime --with transformers \
python src/run_onnx.py --query "your query" --docs-file data/example_docs.json --model-dir onnx/mxbai-base --model-file model-int8.onnx --top-k 3
```
Modal Cloud Deployment
Deploy to Modal for serverless inference:
```bash
vLLM GPU deployment (dev server)
make modal-serve
Deploy to production
make modal-deploy
ONNX CPU deployment
make modal-serve-onnx
make modal-deploy-onnx
```
**Environment toggle**:
`FAST_BOOT=true`: Faster cold starts (disables compilation)`FAST_BOOT=false` (default): Better throughput with compiler optimizations```bash
FAST_BOOT=true uvx modal serve modal_app.py
```
Benchmarking
Test reranker performance with BEIR datasets:
```bash
Build test dataset (scifact example)
make bench-build BENCH_DATASET=scifact BENCH_LIMIT=100
Run async benchmark
make bench-run BENCH_URL=https://<modal-url> CONCURRENCY=16 BENCH_METRICS=1
```
Architecture Details
Model Configuration
The Mixbread v2 reranker requires specific vLLM overrides:
```json
{
"architectures": ["Qwen2ForSequenceClassification"],
"classifier_from_token": ["0", "1"],
"method": "from_2_way_softmax"
}
```
vLLM must run with `--task score` flag for rerank/scoring modeModal deployment uses `@modal.web_server` pattern with vLLM subprocessKey Files
`modal_app.py`: vLLM GPU Modal deployment (L4 GPU default, preloads weights during image build)`modal_app_onnx.py`: CPU-only Modal deployment with int8 quantized ONNX, serves `/rerank` + `/health``src/run_local.py`: Local Python wrapper using `mxbai_rerank` package`src/onnx_reranker.py`: Shared ONNX inference with prompt formatting and batched scoring`src/client_vllm_rerank.py`: HTTP client for `/v1/rerank` endpoint`bench/build_beir_subset.py`: BEIR dataset sampling`bench/bench_rerank_async.py`: Async throughput/latency testingModal Deployment Patterns
**Image building**: Uses `uv_pip_install` with pinned versions (vllm==0.11.0, torch==2.8.0)**Weights preloading**: `.run_function(preload_model, volumes=...)` during image build minimizes cold start latency**Cache volumes**: `/root/.cache/huggingface` and `/root/.cache/vllm`**GPU targeting**: `TORCH_CUDA_ARCH_LIST=89` targets L4 GPUs for faster compilation**ONNX optimization**: Model built/quantized once into Modal volume, reused across containersONNX Implementation
**Prompt format**: Qwen chat template with custom binary relevance task prompt**Scoring**: Extracts yes_loc/no_logit logits, computes `yes_logits - no_logits`**Padding**: Multiple-of-8 for performance, up to model max length (8192 default)**Tokenizer**: Left-padding with fast tokenizer, handles long docs via truncationDevelopment Tips
Testing New Ideas
Use throwaway `uv run` one-liners before editing code:
```bash
uv run --with <libs> python -c "<code>"
```
For longer experiments, write to scratch files but don't commit clutter.
Dependency Management
All dependencies installed ad-hoc via `uv run --with <package>`. No requirements.txt. Modal apps pin versions in `.uv_pip_install()`. Update versions in `modal_app.py` and `modal_app_onnx.py` when upgrading.
Performance Tuning
**Batch size**: Tune via `--max-num-seqs` (vLLM) and client-side chunking**Dtype**: `--dtype auto` selects bf16/fp16 on GPU**Concurrency**: `@modal.concurrent(max_inputs=...)` controls container parallelism**Cold starts**: `FAST_BOOT=true` disables compilation for faster boot but lower throughput**Prewarm**: Keep replicas warm via `scaledown_window` or use `FAST_BOOT=false` for compiler optimizationsGPU Requirements
vLLM Docker path requires NVIDIA GPU + NVIDIA Container ToolkitmacOS/Windows: Use local Python wrapper or Modal deploymentApple Silicon: Use `make local-run-mps` with MPS fallbackTest Data Files
`data/example_docs.json`: Minimal test case`data/hf_harper_lee.json`: HuggingFace example (mirrors official docs)`data/rerank_samples.json`: Multilingual, code, long-context cases`data/bench/*.json`: Generated via `make bench-build`Constraints
All `uv run` commands must include `--with` flags for dependenciesModal apps require pinned versions in `.uv_pip_install()` callsvLLM GPU paths require NVIDIA GPU hardwareONNX quantization requires model export step before inference