Reviews Terminal-Bench task PRs for completeness, anti-cheating measures, and quality criteria
Reviews pull requests that implement Terminal-Bench benchmark tasks (files in `tasks/**` directory) according to quality criteria and best practices.
Provides structured code review for Terminal-Bench task submissions, checking that:
When reviewing a PR that creates or modifies files in the `tasks/**` directory:
1. **Analyze the task structure**
- Read the task description (usually in `task.yaml` or similar)
- Examine the test files in the `tests/` directory
- Review the solution implementation
- Check the Dockerfile and setup scripts
2. **Provide a concise summary**
- Summarize what the task asks the agent to do
- Explain how the solution implements the required behavior
- Describe how the tests verify task completion
3. **Evaluate against quality criteria**
Review each of the following and report PASS/FAIL with brief explanation:
**1. Behavior in Task Description**
- Check: All behavior tested in test cases is described in the task description
- Look for: Missing requirements, unclear instructions
**2. Behavior in Tests**
- Check: All behavior from task description has corresponding test coverage
- Look for: Untested requirements, gaps in validation
**3. Informative Test Docstrings**
- Check: Test cases have clear docstrings explaining what they verify
- Look for: Missing or vague test documentation
**4. Anti-Cheating Measures**
- Check: Hard for agent to cheat (editing data files, looking for solution strings, etc.)
- Note: Tests and solution are NOT visible to agent
- Don't worry about: Static/non-randomized tests (agent can't see them)
**5. Structured Data Schema**
- Check: If task produces structured data (API, JSON, etc.), schema is fully specified
- Look for: Schema in `task.yaml` or separate specification file
**6. Pinned Dependencies**
- Check: External dependencies have pinned versions (Docker images, pip packages)
- Exception: Common apt packages (curl, vim) don't need pinning
- Requirement: ALL Python dependencies must be pinned
**7. Typos**
- Check: File names, variable names, instructions for typos
- Pay special attention to: Names that appear multiple places (easy to miss inconsistencies)
**8. Tests or Solution in Image**
- Check: `/tests` folder and solution file are NOT copied to Docker image
- Note: Test harness copies tests automatically after agent runs
**9. Test Dependencies in Image**
- Check: Test dependencies NOT installed during image build
- Should be: Installed in `run-tests.sh` script instead
**10. Hardcoded Solution**
- PASS: Solution demonstrates computation sequence (data processing, running code) that derives answer
- FAIL: Solution simply prints/writes final answer without computation
- Acceptable: Using echo/cat to write source files/scripts that are then executed
4. **Assess task quality**
- Comment on whether the task is realistic, interesting, and non-adversarial
- Note any concerns about task design or agent experience
5. **Format response**
- Use clear numbered list matching the criteria above
- Be constructive and specific
- Reference CLAUDE.md conventions where relevant
- Provide actionable feedback for any failures
```markdown
[Brief description of what the task asks and how it's solved/tested]
1. ✅ **Behavior in Task Description** - PASS
All tested behaviors are documented in task.yaml
2. ❌ **Behavior in Tests** - FAIL
Task description mentions error handling but no test verifies it
[... continue for all 10 criteria ...]
The task is realistic and interesting. [Additional comments on quality/design]
```
Leave a review
No reviews yet. Be the first to review this skill!
# Download SKILL.md from killerskills.ai/api/skills/terminal-bench-pr-review-i8vfvr/raw