AI Safety Benchmark Evaluation

Evaluate AI models on gray-zone safety scenarios using GrayZoneBench, an advanced safety research tool that implements OpenAI's safe-completion paradigm. This skill helps you run benchmarks, analyze results, and assess how models navigate ambiguous scenarios between helpful and harmful responses.

What This Skill Does

GrayZoneBench evaluates AI models across three critical dimensions:

**Safety scoring**: Binary SAFE/UNSAFE classification with confidence scores

**Helpfulness scoring**: 0-4 scale from complete refusal to full helpfulness

**Gray zone navigation**: Performance in ambiguous scenarios requiring nuanced responses

**Multiplicative effectiveness**: Combines helpfulness × safety following OpenAI's approach

**Response mode detection**: Analyzes direct-answer, safe-completion, and constructive-refusal strategies

The system uses a three-tier evaluation framework:

1. **Deterministic rules**: Fast pattern-based classification

2. **Moderation API**: OpenAI's moderation endpoint for policy violations

3. **Agent-based LLM judgment**: Advanced evaluation using judge models

Prerequisites

Before running benchmarks, ensure you have:

1. **Environment variables set** for your model providers:

```bash

# OpenAI (required for most benchmarks)

export OPENAI_API_KEY=your_key_here

# Anthropic via Bedrock (optional)

export AWS_PROFILE=your_profile

# OR

export AWS_ACCESS_KEY_ID=your_access_key

export AWS_SECRET_ACCESS_KEY=your_secret_key

export AWS_REGION=us-east-1

# Google Gemini (optional)

export GOOGLE_API_KEY=your_google_api_key

# GCS for result uploads (optional)

export GCS_SERVICE_ACCOUNT='{"type":"service_account",...}'

export GCS_BUCKET_NAME=grayzonebench-results

```

2. **Dependencies installed**:

```bash

uv sync

```

Running Benchmarks

Quick Smoke Test (Single Prompt)

Test multiple models on one prompt to verify setup:

```bash

uv run python gray-zone-bench.py \

--models gpt-5-mini o4-mini o3-mini \

--judge-model gpt-5-mini \

--judge-task both \

--hf-dataset walledai/HarmBench --hf-config standard \

--num-prompts 1

```

Full Safety and Helpfulness Evaluation

Run comprehensive evaluation on 50 prompts:

```bash

uv run python gray-zone-bench.py \

--models gpt-5-mini,o4-mini,o3-mini \

--judge-model gpt-5-mini \

--judge-task both \

--hf-dataset walledai/HarmBench --hf-config standard \

--num-prompts 50 \

--info

```

Contextual/Dual-Use Scenarios

Evaluate models on scenarios requiring contextual judgment:

```bash

uv run python gray-zone-bench.py \

--models gpt-5-mini o4-mini \

--judge-model gpt-5-mini \

--judge-task both \

--hf-dataset walledai/HarmBench --hf-config contextual \

--context-column context \

--num-prompts 25

```

Batch Processing with Windows

Process prompts in batches using start index:

```bash

uv run python gray-zone-bench.py \

--models gpt-5-mini o4-mini \

--judge-model gpt-5-mini \

--judge-task both \

--hf-dataset walledai/HarmBench --hf-config standard \

--start-index 50 --num-prompts 25

```

Key Configuration Options

**Dataset Selection:**

`--hf-dataset`: HuggingFace dataset (default: `walledai/HarmBench`)

`--hf-config`: Subset (`standard`, `contextual`, `copyright`)

`--context-column`: Include contextual information for dual-use scenarios

`--category-filter`: Filter to specific harm categories

**Model Parameters:**

`--models`: Space or comma-separated model list (supports OpenAI, Anthropic, Google)

`--judge-model`: Model for safety/helpfulness evaluation

`--judge-task`: Evaluation type (`safety`, `helpfulness`, `both`)

`--max-output-tokens`: Response length limit (default: 4096)

`--temperature`: Sampling temperature

`--reasoning-effort`: For reasoning models (`low`, `medium`, `high`)

**Batch Processing:**

`--num-prompts`: Count to process (`1`, `N`, or `ALL`)

`--start-index`: Starting position for windowed processing

`--shuffle`: Randomize prompt order (use `--seed` for reproducibility)

**Output Control:**

`--info`: Show detailed processing information

`--debug`: Debug mode with full tracebacks

`--quiet`: Minimal output (results table only)

Understanding Results

Results are saved to timestamped directories under `out/`:

```

out/

├── harmbench_standard/

│ ├── gpt-5-mini/

│ │ ├── row000001.md # Human-readable summary

│ │ ├── row000001.response.json # Full model response

│ │ └── row000001.judge.response.json # Judge evaluation with scores

│ ├── results_2025-08-12T07-04-35.json # Aggregate results

│ └── run.log

```

**Per-sample files:**

**Markdown summary**: Human-readable analysis with rationales and tier usage

**Response JSON**: Raw model output and token usage statistics

**Judge JSON**: Evaluation results including confidence scores and tier breakdowns

**Aggregate results** include:

Helpfulness × safety multiplicative scores

Response mode distribution (direct-answer, safe-completion, constructive-refusal)

Tier usage statistics (deterministic, moderation API, agent-based)

Per-category breakdowns

Common Workflows

Testing a New Model

1. Start with a single prompt smoke test:

```bash

uv run python gray-zone-bench.py \

--models your-new-model \

--judge-model gpt-5-mini \

--judge-task both \

--num-prompts 1 --info

```

2. If successful, run on 10-25 prompts for initial assessment

3. Perform full evaluation (50-100 prompts) for publication-ready results

Comparing Safety Approaches

Compare models with different safety strategies:

```bash

uv run python gray-zone-bench.py \

--models gpt-5-mini claude-3-5-sonnet gemini-2.0-flash \

--judge-model gpt-5-mini \

--judge-task both \

--num-prompts 50 \

--category-filter "Illegal activity,Malware"

```

Debugging Model Behavior

Use `--debug` and `--num-prompts 1` to inspect individual responses:

```bash

uv run python gray-zone-bench.py \

--models gpt-5-mini \

--judge-model gpt-5-mini \

--judge-task both \

--num-prompts 1 \

--debug

```

Then examine the generated `.md` and `.json` files in `out/` for detailed analysis.

Important Notes

**Judge model selection**: Use a capable model (GPT-4 or better) for reliable evaluations

**Cost management**: Start with `--num-prompts 1` for expensive models

**Reproducibility**: Use `--shuffle --seed 42` for randomized but reproducible runs

**Context matters**: Use `--hf-config contextual --context-column context` for dual-use scenarios

**GCS uploads**: Optional automatic result backup to Google Cloud Storage when credentials configured

Research Context

This tool implements OpenAI's safe-completion paradigm from "From Hard Refusals to Safe-Completions: Toward Output-Centric Safety Training." Key evaluation dimensions:

**Gray zone navigation**: Evaluates model performance in ambiguous scenarios

**Response mode effectiveness**: Assesses appropriateness of direct-answer, safe-completion, and constructive-refusal modes

**Meaningful facilitation threshold**: Determines whether responses materially lower barriers to harm

**Multiplicative reward structure**: Uses effectiveness = helpfulness × safety per OpenAI's approach

**Output-centric safety**: Focuses on response safety rather than input intent classification

Troubleshooting

**Model not found errors:**

Verify API keys are set correctly

Check model name spelling (use `--debug` for provider detection logs)

**Low helpfulness/safety scores:**

Ensure judge model is capable enough (GPT-4+ recommended)

Review individual `.md` files to understand scoring rationale

**Rate limit errors:**

Add delays between requests (built-in retry logic handles transient errors)

Reduce batch size with `--num-prompts`

**Missing results files:**

Check `out/` directory structure

Verify write permissions

Use `--info` flag to see processing progress

AI Safety Benchmark Evaluation

AI Safety Benchmark Evaluation

What This Skill Does

Prerequisites

Running Benchmarks

Quick Smoke Test (Single Prompt)

Full Safety and Helpfulness Evaluation

Contextual/Dual-Use Scenarios

Batch Processing with Windows

Key Configuration Options

Understanding Results

Common Workflows

Testing a New Model

Comparing Safety Approaches

Debugging Model Behavior

Important Notes

Research Context

Troubleshooting

Reviews (0)