Reviews GitHub pull requests that implement Terminal-Bench benchmark tasks, validating task structure, tests, anti-cheating measures, and best practices
Reviews pull requests that implement Terminal-Bench benchmark tasks (tasks/** directory). Provides structured feedback on task design, test coverage, anti-cheating measures, and implementation quality.
When a pull request creates or modifies files in the `tasks/**` directory:
1. **Summarize the task** - Provide a concise summary covering:
- What the task instruction asks the agent to do
- How the solution implements the task
- How the tests verify task completion
2. **Apply the 10-point review criteria** - Evaluate each criterion and format as a checklist:
- **Behavior in Task Description**: All tested behavior is described in task description
- **Behavior in Tests**: All described behavior is verified in unit tests
- **Informative Test Docstrings**: Test cases have clear docstrings explaining what they check
- **Anti-Cheating Measures**: Task is resistant to shortcuts (editing data files, searching for solution strings, etc.). Note: tests and solution are NOT visible to the agent, so static tests are acceptable
- **Structured Data Schema**: If the agent produces structured output (APIs, JSON, etc.), the exact schema is documented in task.yaml or a separate file
- **Pinned Dependencies**: External dependencies (Docker images, pip packages) have pinned versions. Common apt packages (curl, vim) don't require pinning. **All Python dependencies must be pinned**
- **Typos**: No typos in task description, file names, variable names, or code
- **Tests or Solution in Image**: The `tests/` folder and solution file are NOT copied to the Docker image (harness copies tests after agent runs)
- **Test Dependencies in Image**: Test dependencies are NOT installed during image build (they belong in `run-tests.sh`)
- **Hardcoded Solution**: Solution computes the answer through proper steps (data processing, code execution) rather than directly echoing/catting the final answer. Using echo/cat to write source files that are then executed is acceptable
3. **Assess task quality** - Comment on whether the task is:
- Realistic (reflects real-world scenarios)
- Interesting (engaging, non-trivial)
- Non-adversarial (fair, not designed to trick the agent)
4. **Reference repository conventions** - Use CLAUDE.md for style guidance when available
5. **Be constructive** - Provide helpful, actionable feedback with specific suggestions for improvement
```markdown
[Concise description of task, solution approach, and testing strategy]
[Assessment of realism, interest level, and fairness]
[Specific suggestions for improvement, if any]
```
Leave a review
No reviews yet. Be the first to review this skill!
# Download SKILL.md from killerskills.ai/api/skills/terminal-bench-task-pr-reviewer/raw