AI Safety Gray Zone Benchmark Evaluation

Evaluate AI models on the "gray zone" between helpful and harmful responses using GrayZoneBench, an implementation of OpenAI's safe-completion paradigm with advanced navigation assessment.

Overview

This skill helps you run comprehensive AI safety benchmarks that measure:

Gray zone navigation in ambiguous scenarios

Response mode effectiveness (direct answer, safe-completion, constructive refusal)

Meaningful facilitation assessment (whether responses materially lower barriers to harm)

Multiplicative scoring: effectiveness = helpfulness × safety

Three-tier evaluation: deterministic rules → moderation API → agent-based LLM judgment

Prerequisites

Before running benchmarks, ensure:

1. **Install dependencies:**

```bash

uv sync

```

2. **Set required environment variables:**

- OpenAI: `OPENAI_API_KEY`

- Anthropic (Bedrock): `AWS_PROFILE` or `AWS_ACCESS_KEY_ID` + `AWS_SECRET_ACCESS_KEY` + `AWS_REGION`

- Google (Gemini): `GOOGLE_API_KEY`

- Optional GCS upload: `GCS_SERVICE_ACCOUNT` + `GCS_BUCKET_NAME`

Running Benchmarks

Quick Smoke Test

Test multiple models on a single prompt:

```bash

uv run python gray-zone-bench.py \

--models gpt-5-mini o4-mini o3-mini \

--judge-model gpt-5-mini \

--judge-task both \

--hf-dataset walledai/HarmBench --hf-config standard \

--num-prompts 1

```

Full Evaluation

Run comprehensive safety and helpfulness grading:

```bash

uv run python gray-zone-bench.py \

--models gpt-5-mini,o4-mini,o3-mini \

--judge-model gpt-5-mini \

--judge-task both \

--hf-dataset walledai/HarmBench --hf-config standard \

--num-prompts 50 \

--info

```

Contextual/Dual-Use Evaluation

Test models on dual-use scenarios with context:

```bash

uv run python gray-zone-bench.py \

--models gpt-5-mini o4-mini \

--judge-model gpt-5-mini \

--judge-task both \

--hf-dataset walledai/HarmBench --hf-config contextual \

--context-column context \

--num-prompts 25

```

Batch Processing

Process prompts in windows for large-scale evaluation:

```bash

uv run python gray-zone-bench.py \

--models gpt-5-mini o4-mini \

--judge-model gpt-5-mini \

--judge-task both \

--hf-dataset walledai/HarmBench --hf-config standard \

--start-index 50 --num-prompts 25

```

Key Configuration Options

**Dataset Selection:**

`--hf-dataset`: HuggingFace dataset (default: `walledai/HarmBench`)

`--hf-config`: Subset (`standard`, `contextual`, `copyright`)

`--context-column`: Include contextual information

`--category-filter`: Filter by harm category

**Model Parameters:**

`--models`: Space/comma-separated model list (supports OpenAI, Anthropic, Google)

`--judge-model`: Model for safety/helpfulness evaluation

`--judge-task`: Evaluate `safety`, `helpfulness`, or `both`

`--max-output-tokens`: Response length limit (default: 4096)

`--temperature`: Sampling temperature

`--reasoning-effort`: For reasoning models (`low`, `medium`, `high`)

**Batch Processing:**

`--num-prompts`: Number to process (`1`, `N`, or `ALL`)

`--start-index`: Batch starting position

`--shuffle`: Randomize prompt order with `--seed`

**Output Control:**

`--info`: Show detailed processing information

`--debug`: Debug mode with full tracebacks

`--quiet`: Minimal output, results table only

Understanding Results

Results are saved to timestamped directories under `out/`:

```

out/

├── harmbench_standard/

│ ├── gpt-5-mini/

│ │ ├── row000001.md # Human-readable summary

│ │ ├── row000001.response.json # Full model response

│ │ └── row000001.judge.response.json # Judge evaluation with tier results

│ ├── results_2025-08-12T07-04-35.json # Aggregate results with scores

│ └── run.log # Execution logs

```

Each sample generates:

**Markdown summary**: Human-readable analysis with rationales and tier usage

**Response JSON**: Raw model output and usage statistics

**Judge JSON**: Evaluation results including confidence scores and tier breakdowns

Architecture

**Core Components:**

`gray-zone-bench.py`: Main CLI interface and orchestration

`utils/llm_client.py`: Unified LLM interface via LangChain

`utils/judge.py`: Safety and helpfulness evaluation with configurable rubrics

`utils/scoring_rubrics.py`: Three-tier evaluation logic

`utils/agentic_evaluator.py`: Advanced agent-based evaluation

`utils/deterministic_analyzer.py`: Rule-based analysis

`utils/data_processing.py`: HuggingFace dataset loading

**Evaluation Framework:**

Binary SAFE/UNSAFE safety classification with confidence scores

Helpfulness scoring: 0-4 scale from complete refusal to full helpfulness

Three-tier evaluation: deterministic rules → moderation API → agent-based LLM judgment

Multiplicative scoring: effectiveness = helpfulness × safety

Research Context

This tool implements OpenAI's safe-completion paradigm from "From Hard Refusals to Safe-Completions: Toward Output-Centric Safety Training," focusing on:

Gray zone navigation in ambiguous scenarios

Response mode effectiveness assessment

Meaningful facilitation threshold determination

Output-centric safety evaluation

Notes

Supports multiple AI providers: OpenAI (GPT, reasoning models), Anthropic (Claude via Bedrock), Google (Gemini)

Optional GCS integration for automatic result uploads

Live progress display inspired by Claude Code interface

Comprehensive error handling and detailed logging

AI Safety Gray Zone Benchmark Evaluation

AI Safety Gray Zone Benchmark Evaluation

Overview

Prerequisites

Running Benchmarks

Quick Smoke Test

Full Evaluation

Contextual/Dual-Use Evaluation

Batch Processing

Key Configuration Options

Understanding Results

Architecture

Research Context

Notes

Reviews (0)