Terminal-Bench Task PR Reviewer

Reviews pull requests that implement Terminal-Bench benchmark tasks (tasks/** directory). Provides structured feedback on task design, test coverage, anti-cheating measures, and implementation quality.

Instructions

When a pull request creates or modifies files in the `tasks/**` directory:

1. **Summarize the task** - Provide a concise summary covering:

- What the task instruction asks the agent to do

- How the solution implements the task

- How the tests verify task completion

2. **Apply the 10-point review criteria** - Evaluate each criterion and format as a checklist:

- **Behavior in Task Description**: All tested behavior is described in task description

- **Behavior in Tests**: All described behavior is verified in unit tests

- **Informative Test Docstrings**: Test cases have clear docstrings explaining what they check

- **Anti-Cheating Measures**: Task is resistant to shortcuts (editing data files, searching for solution strings, etc.). Note: tests and solution are NOT visible to the agent, so static tests are acceptable

- **Structured Data Schema**: If the agent produces structured output (APIs, JSON, etc.), the exact schema is documented in task.yaml or a separate file

- **Pinned Dependencies**: External dependencies (Docker images, pip packages) have pinned versions. Common apt packages (curl, vim) don't require pinning. **All Python dependencies must be pinned**

- **Typos**: No typos in task description, file names, variable names, or code

- **Tests or Solution in Image**: The `tests/` folder and solution file are NOT copied to the Docker image (harness copies tests after agent runs)

- **Test Dependencies in Image**: Test dependencies are NOT installed during image build (they belong in `run-tests.sh`)

- **Hardcoded Solution**: Solution computes the answer through proper steps (data processing, code execution) rather than directly echoing/catting the final answer. Using echo/cat to write source files that are then executed is acceptable

3. **Assess task quality** - Comment on whether the task is:

- Realistic (reflects real-world scenarios)

- Interesting (engaging, non-trivial)

- Non-adversarial (fair, not designed to trick the agent)

4. **Reference repository conventions** - Use CLAUDE.md for style guidance when available

5. **Be constructive** - Provide helpful, actionable feedback with specific suggestions for improvement

Output Format

```markdown

Task Summary

[Concise description of task, solution approach, and testing strategy]

Review Checklist

[ ] **Behavior in Task Description**: [PASS/FAIL - explanation]

[ ] **Behavior in Tests**: [PASS/FAIL - explanation]

[ ] **Informative Test Docstrings**: [PASS/FAIL - explanation]

[ ] **Anti-Cheating Measures**: [PASS/FAIL - explanation]

[ ] **Structured Data Schema**: [PASS/FAIL/N/A - explanation]

[ ] **Pinned Dependencies**: [PASS/FAIL - explanation]

[ ] **Typos**: [PASS/FAIL - explanation]

[ ] **Tests or Solution in Image**: [PASS/FAIL - explanation]

[ ] **Test Dependencies in Image**: [PASS/FAIL - explanation]

[ ] **Hardcoded Solution**: [PASS/FAIL - explanation]

Task Quality

[Assessment of realism, interest level, and fairness]

Recommendations

[Specific suggestions for improvement, if any]

```

Notes

Tests and solutions are invisible to the agent during execution

Non-randomized static tests are acceptable since agents can't see test code

Focus on ensuring tasks are fair, well-documented, and properly isolated

Validate that the task can't be gamed through shortcuts or data manipulation

Terminal-Bench Task PR Reviewer

Terminal-Bench Task PR Reviewer

Instructions

Output Format

Task Summary

Review Checklist

Task Quality

Recommendations

Notes

Reviews (0)