CURATE PDF Strategy Extractor Assistant

Expert assistant for working with CURATE, a PDF Strategy Extractor that processes German municipal strategy documents using advanced LLMs. The system extracts structured data from PDFs through a multi-stage extraction pipeline with intelligent text extraction, OCR fallback, structure-aware chunking, and progressive LLM-based extraction with Pydantic schemas.

What This Skill Does

This skill provides expert guidance for developing, debugging, and extending the CURATE system. It understands the complete extraction pipeline architecture, entity relationship models, operations-based extraction approach, and best practices for LLM-based knowledge graph generation from municipal strategy documents.

Key Documentation References

Before making changes, always consult:

`OPERATIONS_BASED_EXTRACTION.md` - Context-aware operations extraction implementation

`LLM_GRAPH_GENERATION_RESEARCH.md` - State-of-the-art LLM knowledge graph research

`docs/extraction_pipeline.mmd` - Complete technical architecture diagram

`docs/database.md` - Appwrite schema and entity relationship mapping

`docs/entity_structure.md` - **AUTHORITATIVE** entity relationship reference

System Architecture

Extracted Entity Model

The system extracts four core entity types with specific relationships:

1. **Dimensions** (Action Fields) - Strategic areas with hierarchical support

2. **Measures** - Implementation initiatives (Projects are parent Measures with `isParent: true`)

3. **Indicators** - Quantitative metrics

4. **Connections** - Via `Measure2indicator` junction and `MeasuresExtended` relationships

Processing Pipeline

**Phase 1: Document Ingestion** (`/upload`)

PDF upload via multipart form data

Page-aware text extraction with PyMuPDF, OCR fallback via pytesseract

Saves to `_pages.txt` (no vector embeddings)

Returns `source_id` for extraction

**Phase 2: Enhanced Extraction** (recommended endpoints)

`/extract_enhanced_operations` - **RECOMMENDED** operations-based extraction (CREATE/UPDATE/CONNECT)

`/extract_enhanced` - Legacy direct extraction to 4-bucket structure

Entity resolution and consistency validation

Structured output via Pydantic schemas

Key Components

`src/processing/parser.py` - PDF text extraction with OCR fallback, German-specific cleaning

`src/processing/chunker.py` - Structure-aware chunking with page attribution

`src/extraction/structure_extractor.py` - Core extraction with retry logic

`src/extraction/operations_executor.py` - Operations execution with intelligent merging

`src/core/llm.py` - Multi-provider LLM integration (OpenRouter default)

`prompts/` - YAML-based prompt management system

`json_analyzer/` - Extraction quality analysis with 7 metric categories

Instructions for AI Agent

Environment Setup

1. **Create and activate virtual environment:**

```bash

python3 -m venv venv

source venv/bin/activate

```

2. **Install dependencies:**

```bash

pip install -r requirements.txt

pip install -r requirements-dev.txt

```

3. **Install system dependencies (macOS):**

```bash

brew install tesseract poppler

python -m spacy download de_core_news_lg

```

4. **Configure environment:**

```bash

cp example.env .env

# Edit .env and set OPENROUTER_API_KEY

```

Development Commands

**Start the application:**

```bash

uvicorn main:app --reload

API docs at http://127.0.0.1:8000/docs

```

**Run tests:**

```bash

Run all tests (recommended)

python -m pytest tests/ -v

Run by category

python -m pytest tests/unit/ -v

python -m pytest tests/integration/ -v

python -m pytest tests/functional/ -v

```

**Code quality:**

```bash

Format, lint, and type check

black . && ruff check . && mypy .

Auto-fix issues

ruff check --fix .

```

**Dead code analysis:**

```bash

vulture src/ --min-confidence 60

unimport --check src/

```

**JSON quality analysis:**

```bash

Analyze extraction quality

python -m json_analyzer analyze output.json

Generate HTML report

python -m json_analyzer analyze output.json --format html --output report.html --verbose

Compare extractions

python -m json_analyzer compare before.json after.json

```

When Making Code Changes

1. **Always read documentation first:**

- Check entity structure in `docs/entity_structure.md`

- Review extraction pipeline in `docs/extraction_pipeline.mmd`

- Consult operations guide in `OPERATIONS_BASED_EXTRACTION.md`

2. **Prefer operations-based extraction:**

- Use `/extract_enhanced_operations` endpoint

- Leverage CREATE/UPDATE/CONNECT schema

- Maintain global entity registry for consistency

3. **Understand UPDATE operation merge strategy:**

- String fields: Append new text with smart punctuation

- Lists: Extend with unique items only

- Dicts: Update with new key-value pairs

- **Never replace titles or remove existing content**

4. **Test changes thoroughly:**

- Run pytest suite before committing

- Test with sample PDFs via API

- Analyze extraction quality with json_analyzer

5. **Follow code quality standards:**

- Format with Black (88 char line limit)

- Lint with Ruff (160 char line limit)

- Type check with mypy

- Remove dead code

Testing Workflow

```bash

Test upload and extraction

curl -X POST "http://127.0.0.1:8000/upload" -F "[email protected]"

curl -X GET "http://127.0.0.1:8000/extract_enhanced_operations?source_id=<source-id>"

Run test suite

python -m pytest tests/ -v

Analyze extraction quality

python -m json_analyzer analyze output.json --verbose

```

Key System Constraints

**Target language:** German municipal documents

**LLM backend:** OpenRouter (default), supports Ollama/vLLM/OpenAI

**Chunk sizes:** 8K-12K chars for LLM extraction (configurable)

**OCR fallback:** Automatic quality filtering via language detection + spell checking

**Entity consistency:** Global registry prevents duplicates across chunks

**Prompt management:** All prompts in YAML files under `prompts/` directory

Common Tasks

**Adding new entity fields:**

1. Update Pydantic schemas in `src/core/schemas.py`

2. Modify prompts in `prompts/` directory

3. Update `docs/entity_structure.md`

4. Run tests and validate with json_analyzer

**Improving extraction quality:**

1. Analyze current results with json_analyzer

2. Identify quality issues in specific metric categories

3. Adjust prompts in `prompts/` YAML files

4. Test with sample documents

5. Compare before/after with json_analyzer compare

**Adding new LLM provider:**

1. Add provider configuration to `src/core/config.py`

2. Implement provider-specific logic in `src/core/llm.py`

3. Update `.env` with required API keys

4. Test structured and unstructured generation

Important Notes

**Operations-based extraction is recommended** over direct structure extraction

**Never modify entity titles** via UPDATE operations

**All prompts** must be in YAML files, not inline code

**German language model** (de_core_news_lg) is required for chunking

**Test coverage** is limited; run integration tests for API changes

**JSON analyzer** provides comprehensive quality metrics across 7 categories

Examples

**Testing the full pipeline:**

```bash

1. Start server

uvicorn main:app --reload

2. Upload PDF

curl -X POST "http://127.0.0.1:8000/upload" -F "[email protected]"

3. Extract with operations-based approach

curl -X GET "http://127.0.0.1:8000/extract_enhanced_operations?source_id=abc123"

4. Analyze quality

python -m json_analyzer analyze output.json --format html --output report.html

```

**Debugging extraction issues:**

```bash

Check for dead code

vulture src/ --min-confidence 100

Verify imports

python -c "from src.api import routes; print('OK')"

Run specific test

python -m pytest tests/unit/test_operations_fixes.py -v

```

CURATE PDF Strategy Extractor

CURATE PDF Strategy Extractor Assistant

What This Skill Does

Key Documentation References

System Architecture

Extracted Entity Model

Processing Pipeline

Key Components

Instructions for AI Agent

Environment Setup

Development Commands

API docs at http://127.0.0.1:8000/docs

Run all tests (recommended)

Run by category

Format, lint, and type check

Auto-fix issues

Analyze extraction quality

Generate HTML report

Compare extractions

When Making Code Changes

Testing Workflow

Test upload and extraction

Run test suite

Analyze extraction quality

Key System Constraints

Common Tasks

Important Notes

Examples

1. Start server

2. Upload PDF

3. Extract with operations-based approach

4. Analyze quality

Check for dead code

Verify imports

Run specific test

Reviews (0)