Run comprehensive AI safety benchmarks testing how models handle ambiguous "gray zone" requests between helpful and harmful. Uses OpenAI's safe-completion paradigm with multiplicative scoring (helpfulness × safety).
Evaluate how AI models navigate the "gray zone" between helpful and harmful responses using the GrayZoneBench toolkit. This skill helps you run safety benchmarks, analyze model behavior, and assess gray zone navigation capabilities.
This skill enables you to:
The benchmark uses a **three-tier evaluation system**:
1. **Deterministic rules**: Fast pattern matching for obvious cases
2. **Moderation API**: OpenAI's moderation endpoint for automated classification
3. **Agent-based LLM judgment**: Advanced reasoning for edge cases
**Scoring system**:
```bash
uv sync
```
**Required for OpenAI models:**
```bash
export OPENAI_API_KEY=your_key_here
```
**Required for Anthropic Claude (via AWS Bedrock):**
```bash
export AWS_PROFILE=your_profile
export AWS_ACCESS_KEY_ID=your_access_key
export AWS_SECRET_ACCESS_KEY=your_secret_key
export AWS_REGION=us-east-1
```
**Required for Google Gemini models:**
```bash
export GOOGLE_API_KEY=your_google_api_key
```
**Optional for automatic result uploads (GCS):**
```bash
export GCS_SERVICE_ACCOUNT='{"type":"service_account","project_id":"..."}'
export GCS_BUCKET_NAME=grayzonebench-results
```
Test multiple models on a single prompt to verify setup:
```bash
uv run python gray-zone-bench.py \
--models gpt-5-mini o4-mini o3-mini \
--judge-model gpt-5-mini \
--judge-task both \
--hf-dataset walledai/HarmBench \
--hf-config standard \
--num-prompts 1
```
Run comprehensive evaluation on 50 prompts:
```bash
uv run python gray-zone-bench.py \
--models gpt-5-mini,o4-mini,o3-mini \
--judge-model gpt-5-mini \
--judge-task both \
--hf-dataset walledai/HarmBench \
--hf-config standard \
--num-prompts 50 \
--info
```
Evaluate models on ambiguous requests with legitimate contexts:
```bash
uv run python gray-zone-bench.py \
--models gpt-5-mini o4-mini \
--judge-model gpt-5-mini \
--judge-task both \
--hf-dataset walledai/HarmBench \
--hf-config contextual \
--context-column context \
--num-prompts 25
```
Process large datasets in chunks:
```bash
uv run python gray-zone-bench.py \
--models gpt-5-mini o4-mini \
--judge-model gpt-5-mini \
--judge-task both \
--hf-dataset walledai/HarmBench \
--hf-config standard \
--start-index 50 \
--num-prompts 25
```
Focus on specific harm categories:
```bash
uv run python gray-zone-bench.py \
--models gpt-5-mini \
--judge-model gpt-5-mini \
--judge-task both \
--hf-dataset walledai/HarmBench \
--hf-config standard \
--category-filter "Violence,Hate Speech" \
--num-prompts 10
```
Results are saved to timestamped directories:
```
out/
├── harmbench_standard/ # Based on --hf-config
│ ├── gpt-5-mini/ # Per-model directories
│ │ ├── row000001.md # Human-readable summary
│ │ ├── row000001.response.json # Full model response
│ │ └── row000001.judge.response.json # Judge evaluation
│ ├── results_2025-08-12T07-04-35.json # Aggregate results
│ └── run.log # Execution logs
```
Each evaluated prompt generates:
1. **Markdown summary** (`.md`): Human-readable analysis with rationales and tier usage
2. **Response JSON** (`.response.json`): Raw model output and token usage
3. **Judge JSON** (`.judge.response.json`): Safety/helpfulness scores, confidence, tier breakdowns
The `results_*.json` file contains:
**OpenAI:**
**Anthropic (via Bedrock):**
**Google:**
Model provider is auto-detected from naming patterns.
This benchmark implements OpenAI's safe-completion paradigm from "From Hard Refusals to Safe-Completions: Toward Output-Centric Safety Training."
**Key evaluation dimensions:**
1. **Start small**: Run smoke tests with `--num-prompts 1` before large batches
2. **Use batching**: Process large datasets with `--start-index` and `--num-prompts` for resumability
3. **Monitor costs**: Judge models incur API costs; use cheaper models for initial testing
4. **Review tier usage**: Check `.judge.response.json` files to see which evaluation tier was used
5. **Analyze gray zones**: Focus on samples where models disagree or where scores are borderline
6. **Validate contextual**: Use `--hf-config contextual` to test dual-use scenario handling
7. **Compare models**: Run identical configs across multiple `--models` for fair comparison
8. **Archive results**: Results are timestamped; use GCS integration for centralized storage
Leave a review
No reviews yet. Be the first to review this skill!
# Download SKILL.md from killerskills.ai/api/skills/ai-safety-benchmarking-gray-zone/raw