Run comprehensive gray-zone safety evaluations on AI models using GrayZoneBench. Assess helpfulness, safety, and gray-zone navigation with multi-tier scoring.
Evaluate AI models on gray-zone safety scenarios using GrayZoneBench, an advanced safety research tool that implements OpenAI's safe-completion paradigm. This skill helps you run benchmarks, analyze results, and assess how models navigate ambiguous scenarios between helpful and harmful responses.
GrayZoneBench evaluates AI models across three critical dimensions:
The system uses a three-tier evaluation framework:
1. **Deterministic rules**: Fast pattern-based classification
2. **Moderation API**: OpenAI's moderation endpoint for policy violations
3. **Agent-based LLM judgment**: Advanced evaluation using judge models
Before running benchmarks, ensure you have:
1. **Environment variables set** for your model providers:
```bash
# OpenAI (required for most benchmarks)
export OPENAI_API_KEY=your_key_here
# Anthropic via Bedrock (optional)
export AWS_PROFILE=your_profile
# OR
export AWS_ACCESS_KEY_ID=your_access_key
export AWS_SECRET_ACCESS_KEY=your_secret_key
export AWS_REGION=us-east-1
# Google Gemini (optional)
export GOOGLE_API_KEY=your_google_api_key
# GCS for result uploads (optional)
export GCS_SERVICE_ACCOUNT='{"type":"service_account",...}'
export GCS_BUCKET_NAME=grayzonebench-results
```
2. **Dependencies installed**:
```bash
uv sync
```
Test multiple models on one prompt to verify setup:
```bash
uv run python gray-zone-bench.py \
--models gpt-5-mini o4-mini o3-mini \
--judge-model gpt-5-mini \
--judge-task both \
--hf-dataset walledai/HarmBench --hf-config standard \
--num-prompts 1
```
Run comprehensive evaluation on 50 prompts:
```bash
uv run python gray-zone-bench.py \
--models gpt-5-mini,o4-mini,o3-mini \
--judge-model gpt-5-mini \
--judge-task both \
--hf-dataset walledai/HarmBench --hf-config standard \
--num-prompts 50 \
--info
```
Evaluate models on scenarios requiring contextual judgment:
```bash
uv run python gray-zone-bench.py \
--models gpt-5-mini o4-mini \
--judge-model gpt-5-mini \
--judge-task both \
--hf-dataset walledai/HarmBench --hf-config contextual \
--context-column context \
--num-prompts 25
```
Process prompts in batches using start index:
```bash
uv run python gray-zone-bench.py \
--models gpt-5-mini o4-mini \
--judge-model gpt-5-mini \
--judge-task both \
--hf-dataset walledai/HarmBench --hf-config standard \
--start-index 50 --num-prompts 25
```
**Dataset Selection:**
**Model Parameters:**
**Batch Processing:**
**Output Control:**
Results are saved to timestamped directories under `out/`:
```
out/
├── harmbench_standard/
│ ├── gpt-5-mini/
│ │ ├── row000001.md # Human-readable summary
│ │ ├── row000001.response.json # Full model response
│ │ └── row000001.judge.response.json # Judge evaluation with scores
│ ├── results_2025-08-12T07-04-35.json # Aggregate results
│ └── run.log
```
**Per-sample files:**
**Aggregate results** include:
1. Start with a single prompt smoke test:
```bash
uv run python gray-zone-bench.py \
--models your-new-model \
--judge-model gpt-5-mini \
--judge-task both \
--num-prompts 1 --info
```
2. If successful, run on 10-25 prompts for initial assessment
3. Perform full evaluation (50-100 prompts) for publication-ready results
Compare models with different safety strategies:
```bash
uv run python gray-zone-bench.py \
--models gpt-5-mini claude-3-5-sonnet gemini-2.0-flash \
--judge-model gpt-5-mini \
--judge-task both \
--num-prompts 50 \
--category-filter "Illegal activity,Malware"
```
Use `--debug` and `--num-prompts 1` to inspect individual responses:
```bash
uv run python gray-zone-bench.py \
--models gpt-5-mini \
--judge-model gpt-5-mini \
--judge-task both \
--num-prompts 1 \
--debug
```
Then examine the generated `.md` and `.json` files in `out/` for detailed analysis.
This tool implements OpenAI's safe-completion paradigm from "From Hard Refusals to Safe-Completions: Toward Output-Centric Safety Training." Key evaluation dimensions:
**Model not found errors:**
**Low helpfulness/safety scores:**
**Rate limit errors:**
**Missing results files:**
Leave a review
No reviews yet. Be the first to review this skill!
# Download SKILL.md from killerskills.ai/api/skills/ai-safety-benchmark-evaluation/raw