Reviews Terminal-Bench task PRs against 10 quality criteria including test coverage, anti-cheating measures, dependency pinning, and solution validity
Reviews pull requests that implement Terminal-Bench benchmark tasks, ensuring they meet quality standards for agent evaluation.
When a PR creates files in the `tasks/**` directory, this skill automatically reviews the task implementation against 10 validation criteria, checking test coverage, anti-cheating measures, dependency management, and solution validity.
When reviewing a Terminal-Bench task PR:
1. **Identify the PR scope**: Confirm the PR adds/modifies files in `tasks/**` directory
2. **Generate a concise summary** covering:
- What the task instruction asks the agent to do
- How the solution implements those requirements
- How the tests verify task completion
3. **Evaluate against all 10 criteria**:
**1. Behavior in Task Description**
- Check: Is all behavior tested in test cases also described in the task description?
- Flag: Missing descriptions for tested behaviors
**2. Behavior in Tests**
- Check: Is all behavior from task description validated by unit tests?
- Flag: Untested requirements
**3. Informative Test Docstrings**
- Check: Do test functions have clear docstrings explaining what behavior they verify?
- Flag: Missing or vague docstrings
**4. Anti-Cheating Measures**
- Check: Can the agent cheat by editing data files, searching for solution strings, or exploiting visible artifacts?
- Note: Tests and solutions are hidden from agents; static tests are acceptable
- Flag: Exploitable shortcuts (data file edits, solution string searches)
**5. Structured Data Schema**
- Check: If task produces APIs or structured output, is the exact schema documented in `task.yaml` or linked files?
- Flag: Missing or incomplete schema definitions
**6. Pinned Dependencies**
- Check: Are Docker images, pip packages, and other external dependencies version-pinned?
- Exclude: Common apt packages (curl, vim, etc.) don't require pinning
- Require: All Python dependencies must be pinned
- Flag: Unpinned versions
**7. Typos**
- Check: File names, variable names, instructions, and code comments for spelling errors
- Pay special attention: File paths and identifiers (these are easy to miss)
- Flag: Any typos found
**8. Tests or Solution in Image**
- Check: The `/tests` folder and solution files must NOT be copied into the Docker image
- Note: The harness copies `/tests` automatically after agent runs
- Flag: Tests or solution visible in image build
**9. Test Dependencies in Image**
- Check: Test dependencies should be installed in `run-tests.sh`, NOT during image build
- Flag: Test dependencies in Dockerfile
**10. Hardcoded Solution**
- PASS: Solution demonstrates computation steps (data processing, running scripts, command sequences)
- ACCEPTABLE: Using echo/cat to write source files that are then executed
- FAIL: Solution directly outputs final answer via echo/cat without performing agent-like computation
- Flag: Solutions that print answers without deriving them
4. **Format the response** as a structured list with:
- ✅ for criteria that pass
- ⚠️ for criteria that fail (with explanation)
- Brief notes for each item
5. **Final assessment**: Comment on whether the task is realistic, interesting, and non-adversarial
6. **Reference project conventions**: Consult the repository's `CLAUDE.md` for style guidance
7. **Tone**: Be constructive and helpful in feedback
```markdown
The task asks the agent to [brief description]. The solution implements this by [approach]. Tests verify completion by [testing strategy].
1. ✅ **Behavior in Task Description**: All tested behaviors are documented
2. ⚠️ **Behavior in Tests**: Missing test for error handling mentioned in task.yaml
3. ✅ **Informative Test Docstrings**: All test functions have clear docstrings
4. ✅ **Anti-Cheating Measures**: No exploitable shortcuts identified
5. ⚠️ **Structured Data Schema**: API response schema not fully specified
6. ✅ **Pinned Dependencies**: All dependencies version-pinned
7. ✅ **Typos**: No typos found
8. ✅ **Tests Not in Image**: /tests folder correctly excluded
9. ✅ **Test Dependencies**: Installed in run-tests.sh only
10. ✅ **Hardcoded Solution**: Solution demonstrates proper computation steps
The task is realistic and interesting for benchmarking agent capabilities in [domain]. Suggestion: [constructive feedback].
```
Leave a review
No reviews yet. Be the first to review this skill!
# Download SKILL.md from killerskills.ai/api/skills/terminal-bench-pr-review/raw