Monarch Initiative Koza Ingest

Instructions for working with Monarch Initiative data ingest projects that use the Koza framework to transform biomedical data into standardized knowledge graph formats (Biolink Model).

Project Overview

You are working on a **Monarch Initiative data ingest project** that uses Koza 2.x for ETL transformations. The project processes biological/biomedical data into knowledge graph formats compatible with the Biolink Model.

Project Structure

Key files and directories:

`src/mmrrc-ingest/download.yaml` - Data source definitions and download configuration

`src/mmrrc-ingest/transform.yaml` - Koza transformation pipeline configuration (uses Koza 2.x nested format)

`src/mmrrc-ingest/transform.py` - Python transformation logic implementation

`src/mmrrc-ingest/metadata.yaml` - Project metadata

`src/mmrrc-ingest/cli.py` - Command-line interface

`tests/` - Test directory

Essential Commands

```bash

Setup project environment

just setup

Download source data

just download # or: uv run mmrrc-ingest download

Run transformation pipeline

just transform # or: uv run mmrrc-ingest transform

Run test suite

just test

Validate configuration files

just check-config

```

Development Workflow

Follow this sequence when working on the project:

1. **Define Data Sources**: Update `download.yaml` with data source URLs, formats, and metadata

2. **Configure Transformations**:

- Update `transform.yaml` with Koza 2.x nested structure (reader/transform/writer)

- Implement transformation logic in `transform.py`

3. **Add Tests**: Create tests in `tests/` directory

4. **Validate**: Run `just check-config` to validate YAML configurations with Pydantic models

5. **Execute Pipeline**: Run `just transform` and verify output

6. **Update Documentation**: Keep `README.md` synchronized with changes

Critical: Koza 2.x Transform Patterns

Configuration Structure

Always use the nested `reader`/`transform`/`writer` structure in YAML configs:

```yaml

name: "my_transform"

reader:

format: "csv" # or json, jsonl, tsv, etc.

files: ["data.csv"]

delimiter: ","

transform:

code: "./transform.py"

writer:

node_properties: [...]

min_node_count: 100

```

Transform Functions MUST Return Lists

**CRITICAL REQUIREMENT**: All transform functions MUST return a list of entities, never `None` or bare entity objects:

```python

from typing import Any

from koza import KozaTransform

from koza.model.entity import Entity

@koza.transform_record()

def transform_record(koza_transform: KozaTransform, row: dict[str, Any]) -> list[Entity]:

# Validation: return empty list for invalid records

if not row.get('required_field'):

return [] # CORRECT: Return empty list, NOT None

# Create entity

entity = Entity(...)

# CORRECT: Return list containing entity

return [entity]

# INCORRECT: Do NOT return bare entity or None

# return entity # ❌ Wrong

# return None # ❌ Wrong

```

Type Annotations

Always include proper type hints for mypy compatibility:

```python

from typing import Any

from koza import KozaTransform

from koza.model.entity import Entity

@koza.transform_record()

def transform_record(koza_transform: KozaTransform, row: dict[str, Any]) -> list[Entity]:

"""Transform a single record into knowledge graph entities."""

# Implementation here

return [entity]

```

Multiple Transforms

For projects with multiple transformation pipelines:

Create separate YAML config files (e.g., `genotype.yaml`, `phenotype.yaml`)

Each transform needs its own Python code file

Update `justfile` to run all transforms in sequence

File Editing Priorities

When making changes, prioritize files in this order:

1. **High Priority** (core ingest logic):

- `download.yaml` - data source definitions

- `transform.yaml` - transformation configuration

- `transform.py` - transformation implementation

2. **Medium Priority**:

- `metadata.yaml` - project metadata

- `cli.py` - CLI commands

- Test files in `tests/`

3. **Low Priority** (unless specifically requested):

- Documentation files

- GitHub workflows

- Configuration files

Common Tasks

Adding New Data Sources

Update `download.yaml` with new source definitions:

```yaml

url: "https://example.com/data.csv"

format: "csv"

delimiter: ","

```

Changing Transformation Logic

1. Modify `transform.py` with new transformation code

2. Update `transform.yaml` if reader/writer config changes needed

3. Run `just check-config` to validate

4. Run `just transform` to test

5. Add/update tests in `tests/`

Adding CLI Commands

Extend `cli.py` with new command functions following existing patterns.

Debugging Pipeline Issues

1. Check Koza logs for error messages

2. Validate YAML syntax with `just check-config`

3. Run transform with verbose logging

4. Verify input data format matches reader configuration

5. Ensure transform functions return lists

Dependencies and Package Management

Use **uv** for package management

Maintain compatibility with existing Koza/KGX ecosystem

Add dependencies via `pyproject.toml`

Run `just setup` after dependency changes

Important Reminders

**Context**: This processes biological/biomedical data into knowledge graphs using Biolink Model

**Koza version**: Use Koza 2.x patterns (nested config structure, list returns)

**Type safety**: Always include type annotations for mypy compatibility

**Validation**: Run `just check-config` before committing changes

**Testing**: Run `just test` after modifications

**Return types**: Transform functions MUST return lists, never None or bare entities

Resources

Koza documentation: Refer to Koza 2.x documentation for framework details

Biolink Model: Output should conform to Biolink Model standards

Monarch Initiative: Part of the broader Monarch knowledge graph ecosystem

Monarch Initiative Koza Ingest

Monarch Initiative Koza Ingest

Project Overview

Project Structure

Essential Commands

Setup project environment

Download source data

Run transformation pipeline

Run test suite

Validate configuration files

Development Workflow

Critical: Koza 2.x Transform Patterns

Configuration Structure

Transform Functions MUST Return Lists

Type Annotations

Multiple Transforms

File Editing Priorities

Common Tasks

Adding New Data Sources

Changing Transformation Logic

Adding CLI Commands

Debugging Pipeline Issues

Dependencies and Package Management

Important Reminders

Resources

Reviews (0)