Reviews Terminal-Bench pull requests for task quality, test coverage, anti-cheating measures, and implementation best practices.
Reviews pull requests that implement Terminal-Bench benchmark tasks, ensuring high-quality task design, comprehensive testing, and effective anti-cheating measures.
Analyzes Terminal-Bench task pull requests (tasks in the `tasks/**` directory) against 10 quality criteria, checking task descriptions, test coverage, security measures, dependency management, and solution validity. Provides structured feedback to maintain benchmark integrity.
When a pull request modifies or creates files in the `tasks/**` directory:
1. **Read the entire task submission** including:
- Task description file (typically `task.yaml`)
- Implementation files in the task directory
- Test files in the `tests/` subdirectory
- Solution file(s)
- Any Docker/container configuration
2. **Analyze against the 10 criteria**:
**1. Behavior in Task Description**
- Compare task description against test cases
- Flag any tested behavior not documented in the description
- Ensure completeness of task specification
**2. Behavior in Tests**
- Verify all described behavior has corresponding test coverage
- Check for gaps between description and test implementation
- Ensure no undocumented requirements exist
**3. Informative Test Docstrings**
- Review test case documentation
- Ensure docstrings clearly describe what behavior each test validates
- Flag vague or missing docstrings
**4. Anti-Cheating Measures**
- Evaluate difficulty of cheating (data file manipulation, hardcoded solutions, test visibility)
- Remember: tests and solutions are NOT visible to agents
- Don't penalize static tests (agents can't see them)
- Flag if agent could trivially bypass the task
**5. Structured Data Schema**
- If task produces APIs or structured output, verify schema is documented
- Check for schema definition in `task.yaml` or separate schema file
- Ensure data format expectations are explicit
**6. Pinned Dependencies**
- Verify all external dependencies have pinned versions
- Check Docker images, pip packages, npm packages, etc.
- Common apt packages (curl, vim, etc.) don't require pinning
- ALL Python dependencies MUST be pinned
**7. Typos**
- Carefully review file names, variable names, command syntax
- Pay special attention to names that are easy to misread
- Check consistency across files
**8. Tests or Solution in Image**
- Verify `/tests` folder is NOT copied to container image
- Verify solution file is NOT copied to container image
- Confirm tests are copied by harness after agent runs
**9. Test Dependencies in Image**
- Ensure test dependencies are NOT installed during image build
- Verify they're installed in `run-tests.sh` script instead
- Check Dockerfile doesn't include test-only packages
**10. Hardcoded Solution**
- Verify solution demonstrates computational steps, not just output
- PASS: Solution runs data processing, executes code, derives answer
- FAIL: Solution uses `echo`/`cat` to directly print final answer
- ACCEPTABLE: Using `echo`/`cat` to write source files that are then executed
3. **Evaluate task quality**:
- Is the task realistic and interesting?
- Is it non-adversarial?
- Does it test meaningful capabilities?
4. **Generate review comment**:
- Start with concise summary of task, implementation, and testing approach
- List findings for each of the 10 criteria as a structured list
- Use clear PASS/FAIL/ISSUE labels
- Provide specific examples and line numbers when flagging issues
- Be constructive and helpful
- Reference repository's `CLAUDE.md` for style guidance
```markdown
[2-3 sentences describing what the task tests and how]
1. **Behavior in Task Description**: [PASS/ISSUE] - [Details]
2. **Behavior in Tests**: [PASS/ISSUE] - [Details]
3. **Informative Test Docstrings**: [PASS/ISSUE] - [Details]
4. **Anti-Cheating Measures**: [PASS/ISSUE] - [Details]
5. **Structured Data Schema**: [PASS/N/A/ISSUE] - [Details]
6. **Pinned Dependencies**: [PASS/ISSUE] - [Details]
7. **Typos**: [PASS/ISSUE] - [Details]
8. **Tests or Solution in Image**: [PASS/ISSUE] - [Details]
9. **Test Dependencies in Image**: [PASS/ISSUE] - [Details]
10. **Hardcoded Solution**: [PASS/ISSUE] - [Details]
[Comment on whether task is realistic, interesting, and non-adversarial]
[Constructive feedback for improvements]
```
**Scenario**: PR adds new task for testing agent's ability to debug a web scraper
**Review identifies**:
**Output**: Structured review with specific line references and actionable fixes
Leave a review
No reviews yet. Be the first to review this skill!
# Download SKILL.md from killerskills.ai/api/skills/terminal-bench-task-reviewer/raw