Terminal-Bench Task PR Review

Reviews pull requests that implement Terminal-Bench benchmark tasks (files in `tasks/**` directory) according to quality criteria and best practices.

What This Skill Does

Provides structured code review for Terminal-Bench task submissions, checking that:

Task descriptions match test coverage

Anti-cheating measures are in place

Dependencies are properly pinned

Tests are isolated from the Docker image

Solutions demonstrate actual computation vs hardcoded answers

Tasks are realistic and interesting

Instructions

When reviewing a PR that creates or modifies files in the `tasks/**` directory:

1. **Analyze the task structure**

- Read the task description (usually in `task.yaml` or similar)

- Examine the test files in the `tests/` directory

- Review the solution implementation

- Check the Dockerfile and setup scripts

2. **Provide a concise summary**

- Summarize what the task asks the agent to do

- Explain how the solution implements the required behavior

- Describe how the tests verify task completion

3. **Evaluate against quality criteria**

Review each of the following and report PASS/FAIL with brief explanation:

**1. Behavior in Task Description**

- Check: All behavior tested in test cases is described in the task description

- Look for: Missing requirements, unclear instructions

**2. Behavior in Tests**

- Check: All behavior from task description has corresponding test coverage

- Look for: Untested requirements, gaps in validation

**3. Informative Test Docstrings**

- Check: Test cases have clear docstrings explaining what they verify

- Look for: Missing or vague test documentation

**4. Anti-Cheating Measures**

- Check: Hard for agent to cheat (editing data files, looking for solution strings, etc.)

- Note: Tests and solution are NOT visible to agent

- Don't worry about: Static/non-randomized tests (agent can't see them)

**5. Structured Data Schema**

- Check: If task produces structured data (API, JSON, etc.), schema is fully specified

- Look for: Schema in `task.yaml` or separate specification file

**6. Pinned Dependencies**

- Check: External dependencies have pinned versions (Docker images, pip packages)

- Exception: Common apt packages (curl, vim) don't need pinning

- Requirement: ALL Python dependencies must be pinned

**7. Typos**

- Check: File names, variable names, instructions for typos

- Pay special attention to: Names that appear multiple places (easy to miss inconsistencies)

**8. Tests or Solution in Image**

- Check: `/tests` folder and solution file are NOT copied to Docker image

- Note: Test harness copies tests automatically after agent runs

**9. Test Dependencies in Image**

- Check: Test dependencies NOT installed during image build

- Should be: Installed in `run-tests.sh` script instead

**10. Hardcoded Solution**

- PASS: Solution demonstrates computation sequence (data processing, running code) that derives answer

- FAIL: Solution simply prints/writes final answer without computation

- Acceptable: Using echo/cat to write source files/scripts that are then executed

4. **Assess task quality**

- Comment on whether the task is realistic, interesting, and non-adversarial

- Note any concerns about task design or agent experience

5. **Format response**

- Use clear numbered list matching the criteria above

- Be constructive and specific

- Reference CLAUDE.md conventions where relevant

- Provide actionable feedback for any failures

Example Review Format

```markdown

Task Summary

[Brief description of what the task asks and how it's solved/tested]

Criteria Review

1. ✅ **Behavior in Task Description** - PASS

All tested behaviors are documented in task.yaml

2. ❌ **Behavior in Tests** - FAIL

Task description mentions error handling but no test verifies it

[... continue for all 10 criteria ...]

Overall Assessment

The task is realistic and interesting. [Additional comments on quality/design]

Recommendations

Add test case for error handling scenario

Pin numpy version in requirements.txt

```

Notes

Be concise and actionable in feedback

Reference specific file names and line numbers when pointing out issues

Assume reviewer has context about Terminal-Bench conventions

Focus on helping the contributor improve the task, not blocking it

terminal-bench-pr-review