vLLM Mixbread Reranker Deployment

Deploy and test the `mixedbread-ai/mxbai-rerank-base-v2` reranker model using vLLM GPU server, ONNX CPU inference, or local Python wrapper. Supports Modal cloud deployment, Docker GPU server, and async benchmarking.

Overview

This skill guides you through deploying and testing a Mixbread Reranker with multiple deployment paths:

**Local testing**: Python wrapper with device selection (CPU/GPU/MPS)

**vLLM GPU**: Docker container or Modal cloud deployment

**ONNX CPU**: Quantized int8 model for CPU-only Modal deployment

**Benchmarking**: Async throughput testing with BEIR datasets

All commands use `uv run` for ad-hoc dependency injection without lock files.

Key Commands

Local Testing

Test the reranker locally using the Python wrapper:

```bash

Basic test with custom query and docs

make local-run QUERY="your query" DOCS=data/example_docs.json

Apple Silicon GPU (MPS)

make local-run-mps

CPU only

make local-run-cpu

Batch test multiple cases

make batch-run

```

The local runner uses the `mxbai_rerank` package and supports automatic device selection.

vLLM Docker (GPU Required)

Run vLLM reranker server in Docker with GPU acceleration:

```bash

Start vLLM server

make vllm-up-docker

Check server health

make vllm-health

Test /v1/rerank endpoint

make vllm-client

Verify Docker GPU access

make gpu-check

```

**Requirements**: NVIDIA GPU + NVIDIA Container Toolkit. Not supported on macOS/Windows.

ONNX CPU Deployment

Export model to ONNX format and apply int8 quantization:

```bash

Export to ONNX

uv run --with optimum[onnxruntime] --with transformers --with torch \

python scripts/export_onnx.py --model-id mixedbread-ai/mxbai-rerank-base-v2 --out-dir onnx/mxbai-base

Quantize to int8

uv run --with onnxruntime --with onnxruntime-tools \

python scripts/quantize_onnx.py --model-path onnx/mxbai-base/model.onnx --out-path onnx/mxbai-base/model-int8.onnx

Run local ONNX inference

uv run --with onnxruntime --with transformers \

python src/run_onnx.py --query "your query" --docs-file data/example_docs.json --model-dir onnx/mxbai-base --model-file model-int8.onnx --top-k 3

```

Modal Cloud Deployment

Deploy to Modal for serverless inference:

```bash

vLLM GPU deployment (dev server)

make modal-serve

Deploy to production

make modal-deploy

ONNX CPU deployment

make modal-serve-onnx

make modal-deploy-onnx

```

**Environment toggle**:

`FAST_BOOT=true`: Faster cold starts (disables compilation)

`FAST_BOOT=false` (default): Better throughput with compiler optimizations

```bash

FAST_BOOT=true uvx modal serve modal_app.py

```

Benchmarking

Test reranker performance with BEIR datasets:

```bash

Build test dataset (scifact example)

make bench-build BENCH_DATASET=scifact BENCH_LIMIT=100

Run async benchmark

make bench-run BENCH_URL=https://<modal-url> CONCURRENCY=16 BENCH_METRICS=1

```

Architecture Details

Model Configuration

The Mixbread v2 reranker requires specific vLLM overrides:

```json

{

"architectures": ["Qwen2ForSequenceClassification"],

"classifier_from_token": ["0", "1"],

"method": "from_2_way_softmax"

}

```

vLLM must run with `--task score` flag for rerank/scoring mode

Modal deployment uses `@modal.web_server` pattern with vLLM subprocess

Key Files

`modal_app.py`: vLLM GPU Modal deployment (L4 GPU default, preloads weights during image build)

`modal_app_onnx.py`: CPU-only Modal deployment with int8 quantized ONNX, serves `/rerank` + `/health`

`src/run_local.py`: Local Python wrapper using `mxbai_rerank` package

`src/onnx_reranker.py`: Shared ONNX inference with prompt formatting and batched scoring

`src/client_vllm_rerank.py`: HTTP client for `/v1/rerank` endpoint

`bench/build_beir_subset.py`: BEIR dataset sampling

`bench/bench_rerank_async.py`: Async throughput/latency testing

Modal Deployment Patterns

**Image building**: Uses `uv_pip_install` with pinned versions (vllm==0.11.0, torch==2.8.0)

**Weights preloading**: `.run_function(preload_model, volumes=...)` during image build minimizes cold start latency

**Cache volumes**: `/root/.cache/huggingface` and `/root/.cache/vllm`

**GPU targeting**: `TORCH_CUDA_ARCH_LIST=89` targets L4 GPUs for faster compilation

**ONNX optimization**: Model built/quantized once into Modal volume, reused across containers

ONNX Implementation

**Prompt format**: Qwen chat template with custom binary relevance task prompt

**Scoring**: Extracts yes_loc/no_logit logits, computes `yes_logits - no_logits`

**Padding**: Multiple-of-8 for performance, up to model max length (8192 default)

**Tokenizer**: Left-padding with fast tokenizer, handles long docs via truncation

Development Tips

Testing New Ideas

Use throwaway `uv run` one-liners before editing code:

```bash

uv run --with <libs> python -c "<code>"

```

For longer experiments, write to scratch files but don't commit clutter.

Dependency Management

All dependencies installed ad-hoc via `uv run --with <package>`. No requirements.txt. Modal apps pin versions in `.uv_pip_install()`. Update versions in `modal_app.py` and `modal_app_onnx.py` when upgrading.

Performance Tuning

**Batch size**: Tune via `--max-num-seqs` (vLLM) and client-side chunking

**Dtype**: `--dtype auto` selects bf16/fp16 on GPU

**Concurrency**: `@modal.concurrent(max_inputs=...)` controls container parallelism

**Cold starts**: `FAST_BOOT=true` disables compilation for faster boot but lower throughput

**Prewarm**: Keep replicas warm via `scaledown_window` or use `FAST_BOOT=false` for compiler optimizations

GPU Requirements

vLLM Docker path requires NVIDIA GPU + NVIDIA Container Toolkit

macOS/Windows: Use local Python wrapper or Modal deployment

Apple Silicon: Use `make local-run-mps` with MPS fallback

Test Data Files

`data/example_docs.json`: Minimal test case

`data/hf_harper_lee.json`: HuggingFace example (mirrors official docs)

`data/rerank_samples.json`: Multilingual, code, long-context cases

`data/bench/*.json`: Generated via `make bench-build`

Constraints

All `uv run` commands must include `--with` flags for dependencies

Modal apps require pinned versions in `.uv_pip_install()` calls

vLLM GPU paths require NVIDIA GPU hardware

ONNX quantization requires model export step before inference

vLLM Mixbread Reranker Deployment

vLLM Mixbread Reranker Deployment

Overview

Key Commands

Local Testing

Basic test with custom query and docs

Apple Silicon GPU (MPS)

CPU only

Batch test multiple cases

vLLM Docker (GPU Required)

Start vLLM server

Check server health

Test /v1/rerank endpoint

Verify Docker GPU access

ONNX CPU Deployment

Export to ONNX

Quantize to int8

Run local ONNX inference

Modal Cloud Deployment

vLLM GPU deployment (dev server)

Deploy to production

ONNX CPU deployment

Benchmarking

Build test dataset (scifact example)

Run async benchmark

Architecture Details

Model Configuration

Key Files

Modal Deployment Patterns

ONNX Implementation

Development Tips

Testing New Ideas

Dependency Management

Performance Tuning

GPU Requirements

Test Data Files

Constraints

Reviews (0)