Terminal-Bench Task Reviewer

Reviews pull requests that implement Terminal-Bench benchmark tasks, ensuring high-quality task design, comprehensive testing, and effective anti-cheating measures.

What This Skill Does

Analyzes Terminal-Bench task pull requests (tasks in the `tasks/**` directory) against 10 quality criteria, checking task descriptions, test coverage, security measures, dependency management, and solution validity. Provides structured feedback to maintain benchmark integrity.

Instructions

When a pull request modifies or creates files in the `tasks/**` directory:

1. **Read the entire task submission** including:

- Task description file (typically `task.yaml`)

- Implementation files in the task directory

- Test files in the `tests/` subdirectory

- Solution file(s)

- Any Docker/container configuration

2. **Analyze against the 10 criteria**:

**1. Behavior in Task Description**

- Compare task description against test cases

- Flag any tested behavior not documented in the description

- Ensure completeness of task specification

**2. Behavior in Tests**

- Verify all described behavior has corresponding test coverage

- Check for gaps between description and test implementation

- Ensure no undocumented requirements exist

**3. Informative Test Docstrings**

- Review test case documentation

- Ensure docstrings clearly describe what behavior each test validates

- Flag vague or missing docstrings

**4. Anti-Cheating Measures**

- Evaluate difficulty of cheating (data file manipulation, hardcoded solutions, test visibility)

- Remember: tests and solutions are NOT visible to agents

- Don't penalize static tests (agents can't see them)

- Flag if agent could trivially bypass the task

**5. Structured Data Schema**

- If task produces APIs or structured output, verify schema is documented

- Check for schema definition in `task.yaml` or separate schema file

- Ensure data format expectations are explicit

**6. Pinned Dependencies**

- Verify all external dependencies have pinned versions

- Check Docker images, pip packages, npm packages, etc.

- Common apt packages (curl, vim, etc.) don't require pinning

- ALL Python dependencies MUST be pinned

**7. Typos**

- Carefully review file names, variable names, command syntax

- Pay special attention to names that are easy to misread

- Check consistency across files

**8. Tests or Solution in Image**

- Verify `/tests` folder is NOT copied to container image

- Verify solution file is NOT copied to container image

- Confirm tests are copied by harness after agent runs

**9. Test Dependencies in Image**

- Ensure test dependencies are NOT installed during image build

- Verify they're installed in `run-tests.sh` script instead

- Check Dockerfile doesn't include test-only packages

**10. Hardcoded Solution**

- Verify solution demonstrates computational steps, not just output

- PASS: Solution runs data processing, executes code, derives answer

- FAIL: Solution uses `echo`/`cat` to directly print final answer

- ACCEPTABLE: Using `echo`/`cat` to write source files that are then executed

3. **Evaluate task quality**:

- Is the task realistic and interesting?

- Is it non-adversarial?

- Does it test meaningful capabilities?

4. **Generate review comment**:

- Start with concise summary of task, implementation, and testing approach

- List findings for each of the 10 criteria as a structured list

- Use clear PASS/FAIL/ISSUE labels

- Provide specific examples and line numbers when flagging issues

- Be constructive and helpful

- Reference repository's `CLAUDE.md` for style guidance

Output Format

```markdown

Task Summary

[2-3 sentences describing what the task tests and how]

Review Checklist

1. **Behavior in Task Description**: [PASS/ISSUE] - [Details]

2. **Behavior in Tests**: [PASS/ISSUE] - [Details]

3. **Informative Test Docstrings**: [PASS/ISSUE] - [Details]

4. **Anti-Cheating Measures**: [PASS/ISSUE] - [Details]

5. **Structured Data Schema**: [PASS/N/A/ISSUE] - [Details]

6. **Pinned Dependencies**: [PASS/ISSUE] - [Details]

7. **Typos**: [PASS/ISSUE] - [Details]

8. **Tests or Solution in Image**: [PASS/ISSUE] - [Details]

9. **Test Dependencies in Image**: [PASS/ISSUE] - [Details]

10. **Hardcoded Solution**: [PASS/ISSUE] - [Details]

Overall Assessment

[Comment on whether task is realistic, interesting, and non-adversarial]

Recommendations

[Constructive feedback for improvements]

```

Example Usage

**Scenario**: PR adds new task for testing agent's ability to debug a web scraper

**Review identifies**:

Task description mentions error handling but no test validates it (Criterion 2)

Beautiful Soup version not pinned in requirements.txt (Criterion 6)

Solution directly echoes final JSON instead of running the scraper (Criterion 10)

**Output**: Structured review with specific line references and actionable fixes

Constraints

Only trigger for PRs modifying `tasks/**` directory

Remember tests and solutions are invisible to agents during benchmark runs

Don't require pinning of standard system packages

Python dependencies always require version pins

Solutions must demonstrate work, not just output answers

Terminal-Bench Task Reviewer

Terminal-Bench Task Reviewer

What This Skill Does

Instructions

Output Format

Task Summary

Review Checklist

Overall Assessment

Recommendations

Example Usage

Constraints

Reviews (0)