Deploy LLM with GitOps
Deploy and manage Large Language Models (LLMs) using ArgoCD/GitOps workflow with VLLM serving on GPU infrastructure.
System Requirements
This skill is optimized for high-performance AI/ML workloads with the following specifications:
**GPU**: NVIDIA GeForce RTX series with CUDA support**RAM**: 60GB+ recommended**Storage**: 1.5TB+ available space**OS**: Linux (Fedora/Ubuntu/Debian)Project Structure
The project follows this directory layout:
```
/home/gpt-oss/
├── backend/ # Backend services
├── frontend/ # Frontend application
├── vllm/ # VLLM configuration
├── models/ # Model storage
├── docker-compose.yml
├── Makefile
└── build-vllm-local.sh
```
Instructions
When deploying or managing LLM models in this environment, follow these steps:
1. Initial Setup and Validation
Verify GPU availability using `nvidia-smi`Check CUDA version compatibilityConfirm Docker and Docker Compose installationVerify ArgoCD connectivity if deploying to cluster2. Model Management
Store model files in `/models/` directoryVerify model compatibility with VLLM and PyTorch versionCheck available disk space before downloading new modelsUse appropriate quantization for VRAM constraints3. VLLM Build and Configuration
**For local VLLM builds:**
```bash
./build-vllm-local.sh
```
**For Docker builds:**
```bash
Standard build
docker-compose build
Fast build with pre-downloaded dependencies
./build-fast.sh
Download dependencies separately
./download-deps.sh
```
4. Running Services
**Start all services:**
```bash
docker-compose up -d
```
**Monitor logs:**
```bash
docker-compose logs -f
```
**Stop services:**
```bash
docker-compose down
```
5. Code Quality Checks
**Backend (Python):**
```bash
cd backend
ruff check .
mypy .
pytest
```
**Frontend (Node.js):**
```bash
cd frontend
npm run lint
npm run typecheck
npm test
```
6. Git Workflow
Work on feature branchesEnsure all tests pass before committingWrite descriptive commit messages focusing on "why"Target `main` branch for production deploymentsLet ArgoCD handle automated deployments7. Makefile Commands
Use provided Make targets for common operations:
```bash
make build # Build all components
make run # Run services
make clean # Clean build artifacts
```
Important Considerations
GPU Compatibility
**RTX 5090 (Blackwell, sm_120)**: May require PyTorch nightly builds or source compilation**RTX 4090 and earlier**: Fully supported with PyTorch 2.5.1+Set appropriate `TORCH_CUDA_ARCH_LIST` for your GPU architectureDocker Build Optimization
Pre-download PyTorch wheels matching Docker image Python version (typically 3.10)Use BuildKit caching: `DOCKER_BUILDKIT=1 docker build`Layer cache in `deps/wheels/` directoryMatch CUDA version between host and containerDependency Version Management
**PyTorch components**: Keep torch, torchvision, torchaudio versions aligned**VLLM**: May auto-upgrade PyTorch to required version**CUDA**: Ensure wheel files match target CUDA version (e.g., cu124)Performance Tuning
Set `VLLM_WORKER_MULTIPROC_METHOD=spawn` for stabilityConfigure `PYTORCH_NO_CUDA_MEMORY_CACHING=1` if neededMonitor GPU memory usage with `nvidia-smi`Adjust batch sizes based on available VRAMCommon Issues and Solutions
Python Version Mismatch
**Problem**: Wheel files incompatible with Docker image Python version**Solution**: Download wheels matching Docker base image (usually Python 3.10)PyTorch Version Conflicts
**Problem**: torchaudio/torchvision version mismatch with torch**Solution**: Use compatible version matrix (e.g., torch 2.5.1 + torchaudio 2.5.1 + torchvision 0.20.1)GPU Not Detected
**Problem**: CUDA compatibility issues with new GPU architectures**Solution**: Try PyTorch nightly or compile from source with explicit arch supportLarge Image Size
**Problem**: Docker images exceeding 30GB**Solution**: Use multi-stage builds, remove build dependencies, leverage layer cachingExamples
Deploy New Model
```bash
Download model to models directory
cd /home/gpt-oss/models
... download model files ...
Update VLLM configuration
cd /home/gpt-oss/vllm
Edit config to reference new model
Rebuild and restart services
cd /home/gpt-oss
docker-compose down
docker-compose up -d --build
```
Update Dependencies
```bash
Download latest compatible wheels
./download-deps.sh
Rebuild VLLM with new dependencies
./build-fast.sh
Restart services
docker-compose restart
```
Monitor Deployment
```bash
Check service status
docker-compose ps
View logs
docker-compose logs -f vllm
Check GPU utilization
watch -n 1 nvidia-smi
```
Notes
This is an ArgoCD/GitOps managed project - infrastructure changes should be committed to GitVLLM is configured for optimized vector LLM operationsSystem specs support running large models (70B+) with appropriate quantizationAlways test builds locally before pushing to GitOps repository