Production-ready text-conditioned object detection with Grounding DINO + SAM 2. Query images with natural language for detection + segmentation on COCO dataset with Streamlit demo.
Production-ready object detection system combining **Grounding DINO** (text-conditioned detection) and **SAM 2** (segmentation masks). Enables querying images with natural language to detect arbitrary object categories.
Guides you through setting up, developing, testing, and evaluating a state-of-the-art open-vocabulary object detection pipeline. The system achieves ~265-490ms inference per image on RTX 3070 GPUs and includes COCO evaluation metrics and an interactive Streamlit demo.
Set up the Conda environment and project structure:
```bash
conda env create -f env-ovod.yml
conda activate ovod
make quickstart
export PYTHONPATH=$(pwd)/repo
```
**Verify installation:**
```bash
python -c "import torch; print(f'CUDA available: {torch.cuda.is_available()}')"
python -c "from transformers import AutoModelForZeroShotObjectDetection; print('✓ Transformers ready')"
```
Key files to understand:
Launch the interactive Streamlit interface:
```bash
make demo
```
**Demo features:**
**Code pattern for pipeline usage:**
```python
from ovod.pipeline import OVODPipeline
import torch
device = "cuda" if torch.cuda.is_available() else "cpu"
pipeline = OVODPipeline(device=device)
result = pipeline(
image_path="path/to/image.jpg",
text_prompt="person . car . dog .",
box_threshold=0.35,
text_threshold=0.25
)
boxes = result["boxes"] # (N, 4) xyxy format
labels = result["labels"] # List of detected class names
scores = result["scores"] # (N,) confidence scores
masks = result["masks"] # List of binary masks
timings = result["timings"] # Latency breakdown dict
```
**Important conventions:**
Run the test suite:
```bash
pytest tests/ -v
pytest -m "not slow"
pytest tests/test_pipeline.py -v
make lint
make clean
```
**Test markers:**
Run evaluation with different prompt strategies:
```bash
make eval-50
make eval-200
make eval-full
```
**Prompt strategies:**
| Strategy | Classes | Use Case |
|----------|---------|----------|
| `person` | 1 class | Person detection only |
| `common` | 18 classes | General objects (person, car, chair, etc.) |
| `full` | 80 classes | Complete COCO class set |
**Evaluation metrics:**
Follow these conventions when contributing:
**Example function signature:**
```python
def process_detections(
boxes: np.ndarray,
labels: List[str],
scores: np.ndarray,
iou_threshold: float = 0.5
) -> Tuple[np.ndarray, List[str], np.ndarray]:
"""Apply non-maximum suppression to detections.
Args:
boxes: (N, 4) bounding boxes in xyxy format
labels: List of N class labels
scores: (N,) confidence scores
iou_threshold: IoU threshold for NMS
Returns:
Tuple of (filtered_boxes, filtered_labels, filtered_scores)
"""
```
**Expected latency on RTX 3070:**
**Optimization strategies:**
**Import errors:**
```bash
export PYTHONPATH=$(pwd)/repo
echo 'export PYTHONPATH=$(pwd)/repo' >> ~/.bashrc
```
**SAM 2 installation issues:**
```bash
cd repo
pip uninstall sam2 -y
pip install -e .
```
**CI test failures:**
```bash
export OVOD_SKIP_SAM2=1
pytest tests/ -v
```
**CUDA out of memory:**
```python
pipeline = OVODPipeline(
device="cuda",
sam_variant="hiera-tiny" # Instead of hiera-large
)
```
**Add custom prompt strategies:**
```python
CUSTOM_CLASSES = {
"indoor": ["chair . table . lamp . sofa ."],
"outdoor": ["car . tree . building . road ."]
}
```
**Integrate new models:**
**Custom evaluation metrics:**
```python
from pycocotools.cocoeval import COCOeval
```
**Quick detection on single image:**
```bash
python -c "
from ovod.pipeline import OVODPipeline
pipeline = OVODPipeline(device='cuda')
result = pipeline('image.jpg', 'person . car .')
print(f'Detected {len(result[\"boxes\"])} objects')
"
```
**Batch evaluation script:**
```python
from ovod.pipeline import OVODPipeline
from pathlib import Path
pipeline = OVODPipeline(device="cuda")
images = Path("data/images").glob("*.jpg")
for img in images:
result = pipeline(str(img), "person . car . dog .")
print(f"{img.name}: {len(result['boxes'])} detections in {result['timings']['total_time']:.2f}s")
```
Leave a review
No reviews yet. Be the first to review this skill!
# Download SKILL.md from killerskills.ai/api/skills/open-vocabulary-object-detection-system/raw