Run comprehensive gray zone safety benchmarks on AI models to evaluate how they navigate ambiguous scenarios between helpful and harmful responses using multiplicative scoring and multi-tier evaluation.
Evaluate AI models on the "gray zone" between helpful and harmful responses using GrayZoneBench, an implementation of OpenAI's safe-completion paradigm with advanced navigation assessment.
This skill helps you run comprehensive AI safety benchmarks that measure:
Before running benchmarks, ensure:
1. **Install dependencies:**
```bash
uv sync
```
2. **Set required environment variables:**
- OpenAI: `OPENAI_API_KEY`
- Anthropic (Bedrock): `AWS_PROFILE` or `AWS_ACCESS_KEY_ID` + `AWS_SECRET_ACCESS_KEY` + `AWS_REGION`
- Google (Gemini): `GOOGLE_API_KEY`
- Optional GCS upload: `GCS_SERVICE_ACCOUNT` + `GCS_BUCKET_NAME`
Test multiple models on a single prompt:
```bash
uv run python gray-zone-bench.py \
--models gpt-5-mini o4-mini o3-mini \
--judge-model gpt-5-mini \
--judge-task both \
--hf-dataset walledai/HarmBench --hf-config standard \
--num-prompts 1
```
Run comprehensive safety and helpfulness grading:
```bash
uv run python gray-zone-bench.py \
--models gpt-5-mini,o4-mini,o3-mini \
--judge-model gpt-5-mini \
--judge-task both \
--hf-dataset walledai/HarmBench --hf-config standard \
--num-prompts 50 \
--info
```
Test models on dual-use scenarios with context:
```bash
uv run python gray-zone-bench.py \
--models gpt-5-mini o4-mini \
--judge-model gpt-5-mini \
--judge-task both \
--hf-dataset walledai/HarmBench --hf-config contextual \
--context-column context \
--num-prompts 25
```
Process prompts in windows for large-scale evaluation:
```bash
uv run python gray-zone-bench.py \
--models gpt-5-mini o4-mini \
--judge-model gpt-5-mini \
--judge-task both \
--hf-dataset walledai/HarmBench --hf-config standard \
--start-index 50 --num-prompts 25
```
**Dataset Selection:**
**Model Parameters:**
**Batch Processing:**
**Output Control:**
Results are saved to timestamped directories under `out/`:
```
out/
├── harmbench_standard/
│ ├── gpt-5-mini/
│ │ ├── row000001.md # Human-readable summary
│ │ ├── row000001.response.json # Full model response
│ │ └── row000001.judge.response.json # Judge evaluation with tier results
│ ├── results_2025-08-12T07-04-35.json # Aggregate results with scores
│ └── run.log # Execution logs
```
Each sample generates:
**Core Components:**
**Evaluation Framework:**
This tool implements OpenAI's safe-completion paradigm from "From Hard Refusals to Safe-Completions: Toward Output-Centric Safety Training," focusing on:
Leave a review
No reviews yet. Be the first to review this skill!
# Download SKILL.md from killerskills.ai/api/skills/ai-safety-gray-zone-benchmark-evaluation/raw