Expert assistant for CURATE - a German municipal strategy document processor. Helps with LLM-based extraction pipelines, entity resolution, operations-based parsing, and knowledge graph generation from PDFs.
Expert assistant for working with CURATE, a PDF Strategy Extractor that processes German municipal strategy documents using advanced LLMs. The system extracts structured data from PDFs through a multi-stage extraction pipeline with intelligent text extraction, OCR fallback, structure-aware chunking, and progressive LLM-based extraction with Pydantic schemas.
This skill provides expert guidance for developing, debugging, and extending the CURATE system. It understands the complete extraction pipeline architecture, entity relationship models, operations-based extraction approach, and best practices for LLM-based knowledge graph generation from municipal strategy documents.
Before making changes, always consult:
The system extracts four core entity types with specific relationships:
1. **Dimensions** (Action Fields) - Strategic areas with hierarchical support
2. **Measures** - Implementation initiatives (Projects are parent Measures with `isParent: true`)
3. **Indicators** - Quantitative metrics
4. **Connections** - Via `Measure2indicator` junction and `MeasuresExtended` relationships
**Phase 1: Document Ingestion** (`/upload`)
**Phase 2: Enhanced Extraction** (recommended endpoints)
1. **Create and activate virtual environment:**
```bash
python3 -m venv venv
source venv/bin/activate
```
2. **Install dependencies:**
```bash
pip install -r requirements.txt
pip install -r requirements-dev.txt
```
3. **Install system dependencies (macOS):**
```bash
brew install tesseract poppler
python -m spacy download de_core_news_lg
```
4. **Configure environment:**
```bash
cp example.env .env
# Edit .env and set OPENROUTER_API_KEY
```
**Start the application:**
```bash
uvicorn main:app --reload
```
**Run tests:**
```bash
python -m pytest tests/ -v
python -m pytest tests/unit/ -v
python -m pytest tests/integration/ -v
python -m pytest tests/functional/ -v
```
**Code quality:**
```bash
black . && ruff check . && mypy .
ruff check --fix .
```
**Dead code analysis:**
```bash
vulture src/ --min-confidence 60
unimport --check src/
```
**JSON quality analysis:**
```bash
python -m json_analyzer analyze output.json
python -m json_analyzer analyze output.json --format html --output report.html --verbose
python -m json_analyzer compare before.json after.json
```
1. **Always read documentation first:**
- Check entity structure in `docs/entity_structure.md`
- Review extraction pipeline in `docs/extraction_pipeline.mmd`
- Consult operations guide in `OPERATIONS_BASED_EXTRACTION.md`
2. **Prefer operations-based extraction:**
- Use `/extract_enhanced_operations` endpoint
- Leverage CREATE/UPDATE/CONNECT schema
- Maintain global entity registry for consistency
3. **Understand UPDATE operation merge strategy:**
- String fields: Append new text with smart punctuation
- Lists: Extend with unique items only
- Dicts: Update with new key-value pairs
- **Never replace titles or remove existing content**
4. **Test changes thoroughly:**
- Run pytest suite before committing
- Test with sample PDFs via API
- Analyze extraction quality with json_analyzer
5. **Follow code quality standards:**
- Format with Black (88 char line limit)
- Lint with Ruff (160 char line limit)
- Type check with mypy
- Remove dead code
```bash
curl -X POST "http://127.0.0.1:8000/upload" -F "[email protected]"
curl -X GET "http://127.0.0.1:8000/extract_enhanced_operations?source_id=<source-id>"
python -m pytest tests/ -v
python -m json_analyzer analyze output.json --verbose
```
**Adding new entity fields:**
1. Update Pydantic schemas in `src/core/schemas.py`
2. Modify prompts in `prompts/` directory
3. Update `docs/entity_structure.md`
4. Run tests and validate with json_analyzer
**Improving extraction quality:**
1. Analyze current results with json_analyzer
2. Identify quality issues in specific metric categories
3. Adjust prompts in `prompts/` YAML files
4. Test with sample documents
5. Compare before/after with json_analyzer compare
**Adding new LLM provider:**
1. Add provider configuration to `src/core/config.py`
2. Implement provider-specific logic in `src/core/llm.py`
3. Update `.env` with required API keys
4. Test structured and unstructured generation
**Testing the full pipeline:**
```bash
uvicorn main:app --reload
curl -X POST "http://127.0.0.1:8000/upload" -F "[email protected]"
curl -X GET "http://127.0.0.1:8000/extract_enhanced_operations?source_id=abc123"
python -m json_analyzer analyze output.json --format html --output report.html
```
**Debugging extraction issues:**
```bash
vulture src/ --min-confidence 100
python -c "from src.api import routes; print('OK')"
python -m pytest tests/unit/test_operations_fixes.py -v
```
Leave a review
No reviews yet. Be the first to review this skill!
# Download SKILL.md from killerskills.ai/api/skills/curate-pdf-strategy-extractor/raw