cc-plugin-eval

Overview

A comprehensive 4-stage evaluation framework for testing Claude Code plugin components. Validates that skills, agents, commands, hooks, and MCP servers trigger correctly through programmatic detection and LLM judgment.

What This Skill Does

This skill provides a systematic approach to evaluate Claude Code plugins by:

1. Analyzing plugin structure and extracting trigger patterns

2. Generating test scenarios (LLM-generated + deterministic)

3. Executing tests via Claude Agent SDK with tool capture

4. Evaluating results with programmatic detection and LLM judgment fallback

Prerequisites

Node.js >= 20.0.0

`ANTHROPIC_API_KEY` environment variable set

Target plugin directory path

Instructions

1. Initial Setup

Before running evaluations, ensure the project is properly configured:

```bash

Install dependencies

npm install

Create .env file with API key

echo "ANTHROPIC_API_KEY=your_key_here" > .env

Verify project health

npm run build && npm run lint && npm run typecheck && npm run format:check && npm run knip && npm test

```

2. Running Evaluations

**Full evaluation run:**

```bash

cc-plugin-eval run -p ./path/to/plugin

```

**Dry run (cost estimation):**

```bash

cc-plugin-eval run -p ./path/to/plugin --dry-run

```

**Resume interrupted run:**

```bash

cc-plugin-eval resume -r <run-id>

```

3. Understanding the Pipeline

The framework executes four sequential stages:

**Stage 1 - Analysis** (`src/stages/1-analysis/index.ts`)

Parses plugin structure (SKILL.md, AGENT.md, COMMAND.md, hooks, MCP config)

Extracts trigger patterns and component metadata

Validates component definitions

**Stage 2 - Generation** (`src/stages/2-generation/index.ts`)

Uses `@anthropic-ai/sdk` to generate test scenarios via LLM

Combines LLM-generated scenarios with deterministic test cases

Creates scenarios that should trigger each component

**Stage 3 - Execution** (`src/stages/3-execution/index.ts`)

Runs scenarios using `@anthropic-ai/claude-agent-sdk`

Captures all tool calls and agent interactions

Uses session management (default: `batched_by_component`)

Reuses sessions per component with `/clear` between scenarios

**Stage 4 - Evaluation** (`src/stages/4-evaluation/index.ts`)

Primary: Programmatic detection (100% accuracy)

- Skills: Check if `Skill` tool called with matching name

- Agents: Verify `Task` tool with correct subagent_type

- Commands: Detect `Bash` tool with matching command

- Hooks: Inspect hook execution in logs

- MCP: Validate MCP tool invocations

Fallback: LLM judge for quality assessment when programmatic detection inconclusive

4. MCP Tool Cost Optimization (CRITICAL)

Always prefer free tools over paid tools:

**Search operations (prefer in order):**

1. Serena `find_symbol` (FREE) - when you know the symbol name

2. Serena `find_referencing_symbols` (FREE) - find all symbol usages

3. Serena `get_symbols_overview` (FREE) - understand file structure

4. `rg "pattern"` (FREE) - regex/text pattern matching

5. Morph `warpgrep_codebase_search` (PAID ~$0.8-1.2/1M tokens) - semantic search, last resort only

**Edit operations (prefer in order):**

1. Serena `replace_symbol_body` (FREE) - replace entire methods/functions

2. Serena `insert_after_symbol` (FREE) - add new code after a symbol

3. Morph `edit_file` (PAID) - partial edits, non-LSP files

4. Built-in `Edit` (AVOID) - fallback only when no other option

5. Working with Results

Evaluation results include:

Detection method used (programmatic vs LLM judge)

Confidence score

Trigger evidence (tool calls, logs)

Session reuse efficiency metrics

6. Extending the Framework

**Adding new component types:**

Update state migration in `src/state/operations.ts`

Add detection logic in Stage 4 evaluation

Implement generation templates in Stage 2

**Customizing session management:**

Modify session strategy in Stage 3 execution

Options: `batched_by_component`, `isolated`, `shared`

**Adding detection patterns:**

Extend programmatic detectors in `src/stages/4-evaluation/detectors/`

Add LLM judge prompts in `src/stages/4-evaluation/judge/`

7. Documentation References

**Task guides**: `docs/*.md` - API usage, component guides, CI/CD integration

**Serena memories**: `.serena/memories/` - architecture decisions, testing patterns, code style

**State migration**: `src/state/operations.ts` - state versioning for new component types

Key Patterns

**Detection hierarchy**: Programmatic detection (100% accuracy) → LLM judge (quality assessment fallback)

**Session efficiency**: Default `batched_by_component` reuses sessions per component with `/clear` between scenarios

**Cost optimization**: Always use free Serena/rg tools before paid Morph tools

**State versioning**: Update `migrateState()` when adding new component types

Example Workflow

```bash

1. Estimate costs

cc-plugin-eval run -p ./my-plugin --dry-run

2. Run full evaluation

cc-plugin-eval run -p ./my-plugin

3. Review results in output directory

cat ./eval-results/<run-id>/summary.json

4. If interrupted, resume

cc-plugin-eval resume -r <run-id>

```

Constraints

Requires valid `ANTHROPIC_API_KEY` with sufficient credits

Node.js version must be >= 20.0.0

Target plugin must follow Claude Code plugin conventions (SKILL.md, AGENT.md, etc.)

LLM judge fallback incurs API costs; use programmatic detection when possible

Session reuse only works with compatible component groupings

cc-plugin-eval

cc-plugin-eval

Overview

What This Skill Does

Prerequisites

Instructions

1. Initial Setup

Install dependencies

Create .env file with API key

Verify project health

2. Running Evaluations

3. Understanding the Pipeline

4. MCP Tool Cost Optimization (CRITICAL)

5. Working with Results

6. Extending the Framework

7. Documentation References

Key Patterns

Example Workflow

1. Estimate costs

2. Run full evaluation

3. Review results in output directory

4. If interrupted, resume

Constraints

Reviews (0)