Monarch Initiative MMRRC Data Ingest

Transform biological and biomedical data into standardized knowledge graphs using the Koza framework. This skill guides you through configuring data sources, implementing transformation logic, and validating outputs for Monarch Initiative ingest projects.

What This Skill Does

This skill helps you work with Monarch Initiative data ingest projects that use Koza for ETL (Extract, Transform, Load) pipelines. You'll learn how to:

Configure data source downloads

Implement Koza transformation logic following best practices

Validate configuration files and test transformations

Debug pipelines and maintain type safety

The output is standardized knowledge graph data in Biolink Model format.

Project Structure

```

src/mmrrc-ingest/

├── download.yaml # Data source definitions

├── transform.yaml # Koza transformation config (nested format)

├── metadata.yaml # Project metadata

├── transform.py # Python transformation logic

└── cli.py # Command-line interface

tests/ # Test suite

```

Step-by-Step Instructions

1. Setup and Data Download

First, set up the project environment and download source data:

```bash

Initialize project dependencies

just setup

Download data sources (defined in download.yaml)

just download

Or: uv run mmrrc-ingest download

```

**When to update `download.yaml`**: Add new data sources by specifying URLs, formats, and metadata. Each source needs a name, URL, file format, and optional compression settings.

2. Configure Transformations

Edit `transform.yaml` using the Koza 2.x nested structure. This is **critical** for proper pipeline execution:

```yaml

name: "my_transform"

reader:

format: "csv" # or "json", "jsonl", "tsv"

files: ["data/my_file.csv"]

delimiter: ","

# Add header, skip_rows, etc. as needed

transform:

code: "./transform.py"

# Optional: filters, map_cache settings

writer:

format: "jsonl" # Output format

node_properties: ["id", "name", "category"]

min_node_count: 100 # Validation threshold

```

**Key configuration sections**:

`reader`: Input file format and parsing options

`transform`: Path to Python transformation code

`writer`: Output format and validation rules

3. Implement Transformation Logic

Edit `transform.py` with these **mandatory patterns**:

#### Pattern 1: Always Return Lists

```python

from typing import Any

from koza import KozaTransform

from biolink_model.datamodel.pydanticmodel_v2 import Entity

@koza.transform_record()

def transform_record(koza_transform: KozaTransform, row: dict[str, Any]) -> list[Entity]:

# CRITICAL: Return empty list for filtered records, NOT None

if not row.get('required_field'):

return []

# Create your entity

entity = Entity(

id=row['id'],

name=row['name'],

category=['biolink:YourCategory']

)

# CRITICAL: Return a list, NOT the bare entity

return [entity]

```

**Why lists?** Koza expects all transform functions to return lists. Returning `None` or bare entities will cause pipeline failures.

#### Pattern 2: Type Annotations

Always include proper type hints for mypy compatibility:

```python

from typing import Any

from koza import KozaTransform

@koza.transform_record()

def transform_record(

koza_transform: KozaTransform,

row: dict[str, Any]

) -> list[Entity]:

# Your transformation logic

...

```

#### Pattern 3: Multiple Transforms

For projects with multiple data types (e.g., genotypes and phenotypes):

1. Create separate YAML files: `genotype.yaml`, `phenotype.yaml`

2. Create corresponding Python files: `genotype_transform.py`, `phenotype_transform.py`

3. Update justfile to run all transforms:

```bash

just transform-genotype

just transform-phenotype

```

4. Validate and Test

After making changes, validate your configuration and run tests:

```bash

Validate configuration files (uses Pydantic models)

just check-config

Run transformation pipeline

just transform

Or: uv run mmrrc-ingest transform

Run test suite

just test

```

5. Debug Common Issues

**Configuration errors**: Check YAML syntax and nested structure in `transform.yaml`

**Transform failures**:

Verify transform functions return lists

Check type annotations are present

Review Koza logs for detailed error messages

**Data validation errors**: Ensure `min_node_count` and required node properties are met

Common Tasks

| Task | Action |

|------|--------|

| Add new data source | Update `download.yaml` with source URL and metadata |

| Change transformation logic | Edit `transform.py` and corresponding YAML config |

| Add CLI command | Extend `cli.py` with new Click commands |

| Debug pipeline | Check Koza logs, run `just check-config` |

| Update dependencies | Use `uv` package manager, maintain Koza/KGX compatibility |

File Editing Priorities

**High priority** (core ingest logic):

`download.yaml` - Data source definitions

`transform.yaml` - Transformation configuration

`transform.py` - Transformation logic

**Medium priority**:

`metadata.yaml` - Project metadata

`cli.py` - CLI commands

`tests/` - Test suite

**Low priority**:

Documentation files

GitHub workflows (unless specifically needed)

Important Notes

**Koza 2.x uses nested configuration**: Old flat YAML structures won't work

**Always return lists from transforms**: Never return `None` or bare entities

**Type hints are mandatory**: Required for mypy type checking

**Data validation runs automatically**: Set appropriate thresholds in writer config

**Use `uv` for dependencies**: Maintains compatibility with Monarch ecosystem

Examples

Example: Basic CSV Transform

**download.yaml**:

```yaml

sources:

- name: "gene_data"

url: "https://example.org/genes.csv"

format: "csv"

```

**transform.yaml**:

```yaml

name: "gene_transform"

reader:

format: "csv"

files: ["data/genes.csv"]

transform:

code: "./transform.py"

writer:

format: "jsonl"

node_properties: ["id", "name", "category"]

min_node_count: 50

```

**transform.py**:

```python

from typing import Any

from koza import KozaTransform

from biolink_model.datamodel.pydanticmodel_v2 import Gene

@koza.transform_record()

def transform_record(koza_transform: KozaTransform, row: dict[str, Any]) -> list[Gene]:

if not row.get('gene_id'):

return []

gene = Gene(

id=f"HGNC:{row['gene_id']}",

name=row['gene_name'],

category=['biolink:Gene']

)

return [gene]

```

Context

This project transforms raw biological data (genes, phenotypes, mouse models, etc.) into standardized knowledge graph formats compliant with the Biolink Model. The Koza framework handles the ETL pipeline, and outputs are used by the Monarch Initiative for data integration and querying.

Monarch Initiative MMRRC Data Ingest

Monarch Initiative MMRRC Data Ingest

What This Skill Does

Project Structure

Step-by-Step Instructions

1. Setup and Data Download

Initialize project dependencies

Download data sources (defined in download.yaml)

Or: uv run mmrrc-ingest download

2. Configure Transformations

3. Implement Transformation Logic

4. Validate and Test

Validate configuration files (uses Pydantic models)

Run transformation pipeline

Or: uv run mmrrc-ingest transform

Run test suite

5. Debug Common Issues

Common Tasks

File Editing Priorities

Important Notes

Examples

Example: Basic CSV Transform

Context

Reviews (0)