AI Safety Benchmarking (Gray Zone)

Evaluate how AI models navigate the "gray zone" between helpful and harmful responses using the GrayZoneBench toolkit. This skill helps you run safety benchmarks, analyze model behavior, and assess gray zone navigation capabilities.

What This Skill Does

This skill enables you to:

Run AI safety benchmarks across multiple models simultaneously

Evaluate models using OpenAI's safe-completion paradigm

Test gray zone navigation (ambiguous scenarios between clearly safe/harmful)

Assess response modes: direct answer, safe-completion, constructive refusal

Generate detailed safety and helpfulness scores with confidence metrics

Process large datasets with batch windowing and parallel execution

Analyze contextual/dual-use scenarios requiring nuanced responses

Core Evaluation Framework

The benchmark uses a **three-tier evaluation system**:

1. **Deterministic rules**: Fast pattern matching for obvious cases

2. **Moderation API**: OpenAI's moderation endpoint for automated classification

3. **Agent-based LLM judgment**: Advanced reasoning for edge cases

**Scoring system**:

**Safety**: Binary SAFE/UNSAFE classification with confidence scores

**Helpfulness**: 0-4 scale (0=complete refusal, 4=full helpfulness)

**Effectiveness**: helpfulness × safety (multiplicative reward structure)

Installation & Setup

Step 1: Install Dependencies

```bash

Install using uv (recommended)

uv sync

```

Step 2: Configure Environment Variables

**Required for OpenAI models:**

```bash

export OPENAI_API_KEY=your_key_here

```

**Required for Anthropic Claude (via AWS Bedrock):**

```bash

export AWS_PROFILE=your_profile

OR

export AWS_ACCESS_KEY_ID=your_access_key

export AWS_SECRET_ACCESS_KEY=your_secret_key

export AWS_REGION=us-east-1

```

**Required for Google Gemini models:**

```bash

export GOOGLE_API_KEY=your_google_api_key

```

**Optional for automatic result uploads (GCS):**

```bash

Paste entire service account JSON as single line

export GCS_SERVICE_ACCOUNT='{"type":"service_account","project_id":"..."}'

export GCS_BUCKET_NAME=grayzonebench-results

```

Running Benchmarks

Quick Smoke Test (Single Prompt)

Test multiple models on a single prompt to verify setup:

```bash

uv run python gray-zone-bench.py \

--models gpt-5-mini o4-mini o3-mini \

--judge-model gpt-5-mini \

--judge-task both \

--hf-dataset walledai/HarmBench \

--hf-config standard \

--num-prompts 1

```

Full Safety + Helpfulness Evaluation

Run comprehensive evaluation on 50 prompts:

```bash

uv run python gray-zone-bench.py \

--models gpt-5-mini,o4-mini,o3-mini \

--judge-model gpt-5-mini \

--judge-task both \

--hf-dataset walledai/HarmBench \

--hf-config standard \

--num-prompts 50 \

--info

```

Contextual/Dual-Use Scenarios

Evaluate models on ambiguous requests with legitimate contexts:

```bash

uv run python gray-zone-bench.py \

--models gpt-5-mini o4-mini \

--judge-model gpt-5-mini \

--judge-task both \

--hf-dataset walledai/HarmBench \

--hf-config contextual \

--context-column context \

--num-prompts 25

```

Batch Processing with Windows

Process large datasets in chunks:

```bash

uv run python gray-zone-bench.py \

--models gpt-5-mini o4-mini \

--judge-model gpt-5-mini \

--judge-task both \

--hf-dataset walledai/HarmBench \

--hf-config standard \

--start-index 50 \

--num-prompts 25

```

Category Filtering

Focus on specific harm categories:

```bash

uv run python gray-zone-bench.py \

--models gpt-5-mini \

--judge-model gpt-5-mini \

--judge-task both \

--hf-dataset walledai/HarmBench \

--hf-config standard \

--category-filter "Violence,Hate Speech" \

--num-prompts 10

```

Key Configuration Options

Dataset Selection

`--hf-dataset walledai/HarmBench` - Dataset source (default)

`--hf-config standard` - Subset: `standard`, `contextual`, `copyright`

`--context-column context` - Include contextual information for dual-use scenarios

`--category-filter "Cat1,Cat2"` - Filter to specific harm categories

Model Configuration

`--models model1,model2` - Space or comma-separated model list

`--judge-model gpt-5-mini` - Model for evaluation

`--judge-task both` - Evaluate: `safety`, `helpfulness`, or `both`

`--max-output-tokens 4096` - Response length limit

`--temperature 0.7` - Sampling temperature

`--reasoning-effort medium` - For reasoning models: `low`, `medium`, `high`

Batch Processing

`--num-prompts N` - Process N prompts (or `ALL`)

`--start-index 50` - Start from row 50 (for windowed batching)

`--shuffle --seed 42` - Randomize with reproducible seed

Output Control

`--info` - Show detailed processing information

`--debug` - Full tracebacks and detailed logging

`--quiet` - Minimal output, results table only

Understanding Results

Output Directory Structure

Results are saved to timestamped directories:

```

out/

├── harmbench_standard/ # Based on --hf-config

│ ├── gpt-5-mini/ # Per-model directories

│ │ ├── row000001.md # Human-readable summary

│ │ ├── row000001.response.json # Full model response

│ │ └── row000001.judge.response.json # Judge evaluation

│ ├── results_2025-08-12T07-04-35.json # Aggregate results

│ └── run.log # Execution logs

```

Per-Sample Files

Each evaluated prompt generates:

1. **Markdown summary** (`.md`): Human-readable analysis with rationales and tier usage

2. **Response JSON** (`.response.json`): Raw model output and token usage

3. **Judge JSON** (`.judge.response.json`): Safety/helpfulness scores, confidence, tier breakdowns

Aggregate Results

The `results_*.json` file contains:

Overall safety/helpfulness statistics

Per-model performance breakdowns

Response mode distribution (direct/safe-completion/refusal)

Meaningful facilitation metrics

Tier usage statistics (deterministic/moderation/agent)

Supported Models

**OpenAI:**

GPT models: `gpt-4o`, `gpt-4o-mini`, `gpt-3.5-turbo`

Reasoning models: `o1`, `o1-mini`, `o3-mini` (supports `--reasoning-effort`)

**Anthropic (via Bedrock):**

Claude models: `claude-3-5-sonnet-20241022`, `claude-3-opus-20240229`

**Google:**

Gemini models: `gemini-2.0-flash-exp`, `gemini-1.5-pro`

Model provider is auto-detected from naming patterns.

Research Context

This benchmark implements OpenAI's safe-completion paradigm from "From Hard Refusals to Safe-Completions: Toward Output-Centric Safety Training."

**Key evaluation dimensions:**

**Gray zone navigation**: Performance in ambiguous scenarios

**Response mode effectiveness**: Quality of direct-answer, safe-completion, constructive-refusal modes

**Meaningful facilitation threshold**: Whether responses materially lower barriers to harm

**Multiplicative reward**: effectiveness = helpfulness × safety

**Output-centric safety**: Focus on response safety vs. input intent classification

Best Practices

1. **Start small**: Run smoke tests with `--num-prompts 1` before large batches

2. **Use batching**: Process large datasets with `--start-index` and `--num-prompts` for resumability

3. **Monitor costs**: Judge models incur API costs; use cheaper models for initial testing

4. **Review tier usage**: Check `.judge.response.json` files to see which evaluation tier was used

5. **Analyze gray zones**: Focus on samples where models disagree or where scores are borderline

6. **Validate contextual**: Use `--hf-config contextual` to test dual-use scenario handling

7. **Compare models**: Run identical configs across multiple `--models` for fair comparison

8. **Archive results**: Results are timestamped; use GCS integration for centralized storage

Constraints

Requires API keys for model providers being tested

Judge model costs scale with number of evaluations (both safety and helpfulness if `--judge-task both`)

Large batches may take significant time; use windowing for parallelization

Deterministic tier only works for obvious cases; ambiguous scenarios require LLM judges

Moderation API tier requires OpenAI API access (even when testing non-OpenAI models)

AI Safety Benchmarking (Gray Zone)

AI Safety Benchmarking (Gray Zone)

What This Skill Does

Core Evaluation Framework

Installation & Setup

Step 1: Install Dependencies

Install using uv (recommended)

Step 2: Configure Environment Variables

OR

Paste entire service account JSON as single line

Running Benchmarks

Quick Smoke Test (Single Prompt)

Full Safety + Helpfulness Evaluation

Contextual/Dual-Use Scenarios

Batch Processing with Windows

Category Filtering

Key Configuration Options

Dataset Selection

Model Configuration

Batch Processing

Output Control

Understanding Results

Output Directory Structure

Per-Sample Files

Aggregate Results

Supported Models

Research Context

Best Practices

Constraints

Reviews (0)