Terminal-Bench PR Review

Reviews pull requests that implement Terminal-Bench benchmark tasks, ensuring they meet quality standards for agent evaluation.

What This Skill Does

When a PR creates files in the `tasks/**` directory, this skill automatically reviews the task implementation against 10 validation criteria, checking test coverage, anti-cheating measures, dependency management, and solution validity.

Instructions

When reviewing a Terminal-Bench task PR:

1. **Identify the PR scope**: Confirm the PR adds/modifies files in `tasks/**` directory

2. **Generate a concise summary** covering:

- What the task instruction asks the agent to do

- How the solution implements those requirements

- How the tests verify task completion

3. **Evaluate against all 10 criteria**:

**1. Behavior in Task Description**

- Check: Is all behavior tested in test cases also described in the task description?

- Flag: Missing descriptions for tested behaviors

**2. Behavior in Tests**

- Check: Is all behavior from task description validated by unit tests?

- Flag: Untested requirements

**3. Informative Test Docstrings**

- Check: Do test functions have clear docstrings explaining what behavior they verify?

- Flag: Missing or vague docstrings

**4. Anti-Cheating Measures**

- Check: Can the agent cheat by editing data files, searching for solution strings, or exploiting visible artifacts?

- Note: Tests and solutions are hidden from agents; static tests are acceptable

- Flag: Exploitable shortcuts (data file edits, solution string searches)

**5. Structured Data Schema**

- Check: If task produces APIs or structured output, is the exact schema documented in `task.yaml` or linked files?

- Flag: Missing or incomplete schema definitions

**6. Pinned Dependencies**

- Check: Are Docker images, pip packages, and other external dependencies version-pinned?

- Exclude: Common apt packages (curl, vim, etc.) don't require pinning

- Require: All Python dependencies must be pinned

- Flag: Unpinned versions

**7. Typos**

- Check: File names, variable names, instructions, and code comments for spelling errors

- Pay special attention: File paths and identifiers (these are easy to miss)

- Flag: Any typos found

**8. Tests or Solution in Image**

- Check: The `/tests` folder and solution files must NOT be copied into the Docker image

- Note: The harness copies `/tests` automatically after agent runs

- Flag: Tests or solution visible in image build

**9. Test Dependencies in Image**

- Check: Test dependencies should be installed in `run-tests.sh`, NOT during image build

- Flag: Test dependencies in Dockerfile

**10. Hardcoded Solution**

- PASS: Solution demonstrates computation steps (data processing, running scripts, command sequences)

- ACCEPTABLE: Using echo/cat to write source files that are then executed

- FAIL: Solution directly outputs final answer via echo/cat without performing agent-like computation

- Flag: Solutions that print answers without deriving them

4. **Format the response** as a structured list with:

- ✅ for criteria that pass

- ⚠️ for criteria that fail (with explanation)

- Brief notes for each item

5. **Final assessment**: Comment on whether the task is realistic, interesting, and non-adversarial

6. **Reference project conventions**: Consult the repository's `CLAUDE.md` for style guidance

7. **Tone**: Be constructive and helpful in feedback

Example Review Format

```markdown

Task Summary

The task asks the agent to [brief description]. The solution implements this by [approach]. Tests verify completion by [testing strategy].

Criteria Review

1. ✅ **Behavior in Task Description**: All tested behaviors are documented

2. ⚠️ **Behavior in Tests**: Missing test for error handling mentioned in task.yaml

3. ✅ **Informative Test Docstrings**: All test functions have clear docstrings

4. ✅ **Anti-Cheating Measures**: No exploitable shortcuts identified

5. ⚠️ **Structured Data Schema**: API response schema not fully specified

6. ✅ **Pinned Dependencies**: All dependencies version-pinned

7. ✅ **Typos**: No typos found

8. ✅ **Tests Not in Image**: /tests folder correctly excluded

9. ✅ **Test Dependencies**: Installed in run-tests.sh only

10. ✅ **Hardcoded Solution**: Solution demonstrates proper computation steps

Assessment

The task is realistic and interesting for benchmarking agent capabilities in [domain]. Suggestion: [constructive feedback].

```

Important Notes

Tests and solutions are NOT visible to the evaluated agent

Static, non-randomized tests are acceptable

Focus on whether the task fairly evaluates agent capabilities

Be specific when flagging issues (include file names, line numbers when relevant)

Terminal-Bench PR Review

Terminal-Bench PR Review

What This Skill Does

Instructions

Example Review Format

Task Summary

Criteria Review

Assessment

Important Notes

Reviews (0)