Open-Vocabulary Object Detection System

Production-ready object detection system combining **Grounding DINO** (text-conditioned detection) and **SAM 2** (segmentation masks). Enables querying images with natural language to detect arbitrary object categories.

What This Skill Does

Guides you through setting up, developing, testing, and evaluating a state-of-the-art open-vocabulary object detection pipeline. The system achieves ~265-490ms inference per image on RTX 3070 GPUs and includes COCO evaluation metrics and an interactive Streamlit demo.

Instructions

1. Environment Setup

Set up the Conda environment and project structure:

```bash

Create and activate the conda environment

conda env create -f env-ovod.yml

conda activate ovod

Run quickstart setup (data linking, directory structure)

make quickstart

Set Python path for imports (add to ~/.bashrc for persistence)

export PYTHONPATH=$(pwd)/repo

```

**Verify installation:**

```bash

python -c "import torch; print(f'CUDA available: {torch.cuda.is_available()}')"

python -c "from transformers import AutoModelForZeroShotObjectDetection; print('✓ Transformers ready')"

```

2. Project Structure Navigation

Key files to understand:

**`repo/ovod/pipeline.py`** — Main `OVODPipeline` orchestrator class (start here)

**`repo/src/detector.py`** — Grounding DINO wrapper for text-conditioned detection

**`repo/src/segmenter.py`** — SAM 2 wrapper for instance segmentation

**`repo/demo_app.py`** — Streamlit web interface for interactive demos

**`repo/eval.py`** — COCO evaluation with multiple prompt strategies

**`repo/src/prompts.py`** — Text prompt parsing and formatting utilities

**`tests/`** — Pytest suite with mocked components for CI

3. Running the Demo

Launch the interactive Streamlit interface:

```bash

make demo

Opens browser at http://localhost:8501

```

**Demo features:**

Upload images or use COCO validation samples

Enter natural language queries (e.g., "person. car. dog.")

View bounding boxes, segmentation masks, and confidence scores

Inspect per-stage latency breakdown

4. Development Workflow

**Code pattern for pipeline usage:**

```python

from ovod.pipeline import OVODPipeline

import torch

Initialize pipeline

device = "cuda" if torch.cuda.is_available() else "cpu"

pipeline = OVODPipeline(device=device)

Run detection + segmentation

result = pipeline(

image_path="path/to/image.jpg",

text_prompt="person . car . dog .",

box_threshold=0.35,

text_threshold=0.25

)

Access results

boxes = result["boxes"] # (N, 4) xyxy format

labels = result["labels"] # List of detected class names

scores = result["scores"] # (N,) confidence scores

masks = result["masks"] # List of binary masks

timings = result["timings"] # Latency breakdown dict

```

**Important conventions:**

Grounding DINO prompts **must end with period-space-separated format**: `"class1 . class2 . class3 ."`

Bounding boxes use **xyxy format** (top-left, bottom-right coordinates)

All coordinates are absolute pixel values (not normalized)

Device handling is automatic but can be overridden

5. Testing

Run the test suite:

```bash

Full test suite

pytest tests/ -v

Skip slow integration tests (default CI behavior)

pytest -m "not slow"

Test specific module

pytest tests/test_pipeline.py -v

Run linting

make lint

Clean build artifacts

make clean

```

**Test markers:**

`@pytest.mark.slow` — Long-running integration tests

`@pytest.mark.unit` — Fast unit tests with mocked components

Tests use `OVOD_SKIP_SAM2=1` environment variable to skip heavy model imports in CI

6. COCO Evaluation

Run evaluation with different prompt strategies:

```bash

Quick 50-image eval with common prompt strategy

make eval-50

200-image evaluation

make eval-200

Full 5000-image COCO validation set

make eval-full

```

**Prompt strategies:**

| Strategy | Classes | Use Case |

|----------|---------|----------|

| `person` | 1 class | Person detection only |

| `common` | 18 classes | General objects (person, car, chair, etc.) |

| `full` | 80 classes | Complete COCO class set |

**Evaluation metrics:**

COCO mAP (mean Average Precision)

Per-class AP scores

Latency breakdown (detection, segmentation, NMS)

Throughput (images/second)

7. Code Style Guidelines

Follow these conventions when contributing:

**Indentation:** 4 spaces (Python), 2 spaces (YAML)

**Type hints:** Required on all function signatures

**Docstrings:** Google-style with Args/Returns sections

**Error handling:** Graceful degradation with informative warnings

**Device agnostic:** Always support both CPU and CUDA paths

**Example function signature:**

```python

def process_detections(

boxes: np.ndarray,

labels: List[str],

scores: np.ndarray,

iou_threshold: float = 0.5

) -> Tuple[np.ndarray, List[str], np.ndarray]:

"""Apply non-maximum suppression to detections.

Args:

boxes: (N, 4) bounding boxes in xyxy format

labels: List of N class labels

scores: (N,) confidence scores

iou_threshold: IoU threshold for NMS

Returns:

Tuple of (filtered_boxes, filtered_labels, filtered_scores)

"""

```

8. Performance Optimization

**Expected latency on RTX 3070:**

Detection (Grounding DINO): ~100ms

Segmentation (SAM 2): ~45ms per object

NMS post-processing: ~5ms

**Total:** ~265ms for single object, ~490ms for multiple objects

**Optimization strategies:**

Batch processing for multiple images

Lower `box_threshold` to reduce SAM 2 calls

Use smaller SAM 2 variants (Hiera-Tiny vs Hiera-Large)

GPU memory management with `torch.cuda.empty_cache()`

9. Common Issues & Solutions

**Import errors:**

```bash

Ensure PYTHONPATH is set

export PYTHONPATH=$(pwd)/repo

Add to ~/.bashrc for persistence

echo 'export PYTHONPATH=$(pwd)/repo' >> ~/.bashrc

```

**SAM 2 installation issues:**

```bash

Reinstall SAM 2 from source

cd repo

pip uninstall sam2 -y

pip install -e .

```

**CI test failures:**

```bash

Skip heavy model loading in CI

export OVOD_SKIP_SAM2=1

pytest tests/ -v

```

**CUDA out of memory:**

```python

Reduce batch size or use smaller models

pipeline = OVODPipeline(

device="cuda",

sam_variant="hiera-tiny" # Instead of hiera-large

)

```

10. Extending the System

**Add custom prompt strategies:**

```python

Edit repo/src/prompts.py

CUSTOM_CLASSES = {

"indoor": ["chair . table . lamp . sofa ."],

"outdoor": ["car . tree . building . road ."]

}

```

**Integrate new models:**

Detector interface in `repo/src/detector.py` (must implement `__call__` with box outputs)

Segmenter interface in `repo/src/segmenter.py` (must accept boxes, return masks)

**Custom evaluation metrics:**

```python

Extend repo/metrics/coco_eval.py

from pycocotools.cocoeval import COCOeval

Add precision-recall curves, per-class analysis, etc.

```

Constraints

**Python 3.10 required** — PyTorch 2.5.1 + CUDA 12.1 compatibility

**GPU recommended** — CPU inference is ~10x slower

**COCO dataset required** — Download from [COCO website](https://cocodataset.org/) for evaluation

**Disk space** — ~15GB for models + COCO validation set

**Memory** — Minimum 8GB GPU VRAM for default SAM 2 (Hiera-Large)

Example Usage

**Quick detection on single image:**

```bash

python -c "

from ovod.pipeline import OVODPipeline

pipeline = OVODPipeline(device='cuda')

result = pipeline('image.jpg', 'person . car .')

print(f'Detected {len(result[\"boxes\"])} objects')

```

**Batch evaluation script:**

```python

from ovod.pipeline import OVODPipeline

from pathlib import Path

pipeline = OVODPipeline(device="cuda")

images = Path("data/images").glob("*.jpg")

for img in images:

result = pipeline(str(img), "person . car . dog .")

print(f"{img.name}: {len(result['boxes'])} detections in {result['timings']['total_time']:.2f}s")

```

Open-Vocabulary Object Detection System

Open-Vocabulary Object Detection System

What This Skill Does

Instructions

1. Environment Setup

Create and activate the conda environment

Run quickstart setup (data linking, directory structure)

Set Python path for imports (add to ~/.bashrc for persistence)

2. Project Structure Navigation

3. Running the Demo

Opens browser at http://localhost:8501

4. Development Workflow

Initialize pipeline

Run detection + segmentation

Access results

5. Testing

Full test suite

Skip slow integration tests (default CI behavior)

Test specific module

Run linting

Clean build artifacts

6. COCO Evaluation

Quick 50-image eval with common prompt strategy

200-image evaluation

Full 5000-image COCO validation set

7. Code Style Guidelines

8. Performance Optimization

9. Common Issues & Solutions

Ensure PYTHONPATH is set

Add to ~/.bashrc for persistence

Reinstall SAM 2 from source

Skip heavy model loading in CI

Reduce batch size or use smaller models

10. Extending the System

Edit repo/src/prompts.py

Extend repo/metrics/coco_eval.py

Add precision-recall curves, per-class analysis, etc.

Constraints

Example Usage

Reviews (0)