Research Agent Development Guide

You are helping develop a claim-level research agent with LLM-assisted evidence extraction, canonicalization, grouping, and adjudication. It supports live search (Google PSE, Brave, Tavily, Serper, PubMed E-utilities) and offline operation with local HTML/PDF sources plus a PubMed baseline subset index.

**Core pipeline**: Search → Fetch/Parse → Extract Propositions → Canonicalize → Group Claims → Adjudicate Evidence → Reduce & Report

Type Checking Requirements

The project uses strict type checking with `ty`. Before making any commits:

1. **Check core modules**:

```bash

uv run ty check src/research_agent/

```

2. **Check tests**:

```bash

uv run ty check tests/

```

3. **Requirements for all new code**:

- Include type annotations on all functions and variables

- All modules must pass `ty check` before committing

- Use `from typing import ...` for proper type hints

- Add `ty` to `[project.optional-dependencies]` dev section in pyproject.toml

Testing Strategy

Use **fixture-based offline testing** to avoid live API dependencies:

1. **Run tests**:

```bash

# All tests

uv run pytest tests/

# Specific test with verbose output

uv run pytest tests/test_extract.py -v

```

2. **Test approach**:

- Use local HTML/PDF fixtures in `tests/fixtures/` and `evals/fixtures/`

- Mock external API calls (search providers, model endpoints)

- Focus on deterministic unit tests for core logic

- Use the eval harness (`evals/`) for probabilistic LLM behavior testing

3. **Eval harness** (separate from unit tests):

```bash

uv run research-agent eval --config agent.yaml --suite evals/suites/smoke.yaml --trials 10

```

Documentation Maintenance

Keep these documents current and concise:

1. README.md (50-100 lines)

Project overview and purpose

Installation (uv)

Basic usage examples (db-init, run, eval, llm-test)

Configuration overview

Key features (online/offline modes)

Link to `docs/` for details

2. system_plan_architecture.md (200-400 lines)

High-level mermaid diagram showing system components and data flow

Brief descriptions of each major subsystem (Agent Orchestrator, Search Broker, Evidence Layer, etc.)

Key design decisions and rationale

Update when adding new subsystems or changing pipeline structure

3. docs/*.md

`evals.md`: Eval harness schema and usage

`mvp_runbook.md`: Offline operation guide

Keep focused on specific workflows, not implementation details

**Update cadence**:

README: Update on any user-facing CLI changes or new features

system_plan_architecture.md: Update when adding/removing major components or changing pipeline flow

docs/*.md: Update when changing eval schema or offline workflows

Development Workflow

Setup

```bash

uv venv

uv pip install -e .

uv pip install ty pytest

```

Before Committing (Required)

```bash

uv run ty check src/research_agent/

uv run pytest tests/

```

Running the Agent

```bash

Initialize database

uv run research-agent db-init --config agent.yaml

Run with live search

uv run research-agent run --config agent.yaml "Your query"

Offline mode (no search APIs)

uv run research-agent run --config agent.yaml --input-dir offline_sources "Your query"

Offline with verbose logging (see chunk processing, LLM calls)

uv run research-agent run -v --config agent.yaml --input-dir offline_sources "Your query"

Run with PubMed baseline provider

uv run research-agent run --config agent.yaml --pubmed-baseline-db ./cache/pubmed_baseline.db "global burden of diabetes"

```

Additional Commands

```bash

Build PubMed baseline index (offline subset)

python scripts/index_pubmed_baseline.py --input-dir offline_sources/pubmed_baseline --db-path ./cache/pubmed_baseline.db

Test model connectivity

uv run research-agent llm-test --config agent.yaml --model local

uv run research-agent llm-test --config agent.yaml --model openrouter

Watch run logs (in another terminal)

tail -f runs/<run_id>/run.log

```

Project-Specific Conventions

LLM Code (Pragmatic Approach)

**Prompts**: LLM prompts in `llm/` and `evidence/` are experimental. Iterate quickly based on eval results.

**Model routing**: Use local (FlashResearch-4B) by default, escalate to OpenRouter (Tongyi-30B) for heavy tasks.

**Prompt engineering**: Track prompt changes in git but don't over-engineer. Let eval results drive improvements.

Evidence Pipeline (Pragmatic Iteration)

**Core modules**: `evidence/extract.py`, `evidence/canonicalize.py`, `evidence/adjudicate.py`, `evidence/reduce.py`

**Iteration**: Move fast on pipeline logic improvements. Use offline fixtures to validate changes.

**Provenance**: W3C annotation selectors and WARC snapshots are best-effort. Don't block on perfect provenance.

Configuration

Use `agent.yaml` for all runtime config (search providers, model endpoints, thinking extent)

Read secrets from environment variables (GOOGLE_PSE_API_KEY, MODEL_API_BASE, OPENROUTER_API_KEY)

Keep `agent.example.yaml` updated as a template

Output Organization

```

runs/<run_id>/ # Per-run outputs

report.md # Final synthesis report

provenance.json # Evidence provenance artifacts

run.log # Human-readable stage progress

trace.jsonl # Detailed content (LLM calls, claims, propositions)

eval_runs/<suite>/<ts>/ # Eval harness outputs

summary.json # Success rates, p-values

cases/ # Per-case trial artifacts

data/agent.db # SQLite persistence (sources, propositions, claims, annotations)

```

When Making Changes

Follow this checklist:

1. **Adding new features**:

- Update README with user-facing changes

- Update system_plan_architecture.md if adding new subsystems

2. **Modifying pipeline**:

- Add fixture-based tests

- Run `uv run ty check src/research_agent/`

- Validate with offline run using `--input-dir`

3. **Changing LLM prompts**:

- Run eval suite to measure impact

- Iterate based on results

- Track changes in git

4. **Configuration changes**:

- Update `agent.example.yaml` template

- Document new environment variables in README

5. **Before committing** (mandatory):

```bash

uv run ty check src/research_agent/ && uv run pytest tests/

```

Key Commands Reference

All commands accept `--verbose/-v` (debug), `--debug` (trace), or `--quiet/-q` (warnings only).

```bash

Initialize database

uv run research-agent db-init --config agent.yaml

Run with live search

uv run research-agent run --config agent.yaml "What are the health effects of coffee?"

Run offline with local sources (use -v to see chunk processing, LLM calls)

uv run research-agent run -v --config agent.yaml --input-dir offline_sources "Your query"

Run with PubMed baseline provider

uv run research-agent run --config agent.yaml --pubmed-baseline-db ./cache/pubmed_baseline.db "global burden of diabetes"

Test LLM connectivity

uv run research-agent llm-test --config agent.yaml --model local

uv run research-agent llm-test --config agent.yaml --model openrouter

Run eval suite

uv run research-agent eval --config agent.yaml --suite evals/suites/smoke.yaml --trials 10 --temperature 0.2

Type check

uv run ty check src/research_agent/

Test

uv run pytest tests/ -v

Watch run logs (in another terminal)

tail -f runs/<run_id>/run.log

```

Success Criteria

Your work is complete when:

All type checks pass (`uv run ty check`)

All tests pass (`uv run pytest`)

Documentation is updated appropriately

Offline run completes successfully (if pipeline changes)

Eval suite shows acceptable performance (if prompt/logic changes)

Research Agent Development Guide

Research Agent Development Guide

Type Checking Requirements

Testing Strategy

Documentation Maintenance

1. README.md (50-100 lines)

2. system_plan_architecture.md (200-400 lines)

3. docs/*.md

Development Workflow

Setup

Before Committing (Required)

Running the Agent

Initialize database

Run with live search

Offline mode (no search APIs)

Offline with verbose logging (see chunk processing, LLM calls)

Run with PubMed baseline provider

Additional Commands

Build PubMed baseline index (offline subset)

Test model connectivity

Watch run logs (in another terminal)

Project-Specific Conventions

LLM Code (Pragmatic Approach)

Evidence Pipeline (Pragmatic Iteration)

Configuration

Output Organization

When Making Changes

Key Commands Reference

Initialize database

Run with live search

Run offline with local sources (use -v to see chunk processing, LLM calls)

Run with PubMed baseline provider

Test LLM connectivity

Run eval suite

Type check

Test

Watch run logs (in another terminal)

Success Criteria

Reviews (0)