Transform biological/biomedical data into knowledge graphs using Koza. Configure data sources, implement transformations, and validate outputs for Monarch Initiative ingest projects.
Transform biological and biomedical data into standardized knowledge graphs using the Koza framework. This skill guides you through configuring data sources, implementing transformation logic, and validating outputs for Monarch Initiative ingest projects.
This skill helps you work with Monarch Initiative data ingest projects that use Koza for ETL (Extract, Transform, Load) pipelines. You'll learn how to:
The output is standardized knowledge graph data in Biolink Model format.
```
src/mmrrc-ingest/
├── download.yaml # Data source definitions
├── transform.yaml # Koza transformation config (nested format)
├── metadata.yaml # Project metadata
├── transform.py # Python transformation logic
└── cli.py # Command-line interface
tests/ # Test suite
```
First, set up the project environment and download source data:
```bash
just setup
just download
```
**When to update `download.yaml`**: Add new data sources by specifying URLs, formats, and metadata. Each source needs a name, URL, file format, and optional compression settings.
Edit `transform.yaml` using the Koza 2.x nested structure. This is **critical** for proper pipeline execution:
```yaml
name: "my_transform"
reader:
format: "csv" # or "json", "jsonl", "tsv"
files: ["data/my_file.csv"]
delimiter: ","
# Add header, skip_rows, etc. as needed
transform:
code: "./transform.py"
# Optional: filters, map_cache settings
writer:
format: "jsonl" # Output format
node_properties: ["id", "name", "category"]
min_node_count: 100 # Validation threshold
```
**Key configuration sections**:
Edit `transform.py` with these **mandatory patterns**:
#### Pattern 1: Always Return Lists
```python
from typing import Any
from koza import KozaTransform
from biolink_model.datamodel.pydanticmodel_v2 import Entity
@koza.transform_record()
def transform_record(koza_transform: KozaTransform, row: dict[str, Any]) -> list[Entity]:
# CRITICAL: Return empty list for filtered records, NOT None
if not row.get('required_field'):
return []
# Create your entity
entity = Entity(
id=row['id'],
name=row['name'],
category=['biolink:YourCategory']
)
# CRITICAL: Return a list, NOT the bare entity
return [entity]
```
**Why lists?** Koza expects all transform functions to return lists. Returning `None` or bare entities will cause pipeline failures.
#### Pattern 2: Type Annotations
Always include proper type hints for mypy compatibility:
```python
from typing import Any
from koza import KozaTransform
@koza.transform_record()
def transform_record(
koza_transform: KozaTransform,
row: dict[str, Any]
) -> list[Entity]:
# Your transformation logic
...
```
#### Pattern 3: Multiple Transforms
For projects with multiple data types (e.g., genotypes and phenotypes):
1. Create separate YAML files: `genotype.yaml`, `phenotype.yaml`
2. Create corresponding Python files: `genotype_transform.py`, `phenotype_transform.py`
3. Update justfile to run all transforms:
```bash
just transform-genotype
just transform-phenotype
```
After making changes, validate your configuration and run tests:
```bash
just check-config
just transform
just test
```
**Configuration errors**: Check YAML syntax and nested structure in `transform.yaml`
**Transform failures**:
**Data validation errors**: Ensure `min_node_count` and required node properties are met
| Task | Action |
|------|--------|
| Add new data source | Update `download.yaml` with source URL and metadata |
| Change transformation logic | Edit `transform.py` and corresponding YAML config |
| Add CLI command | Extend `cli.py` with new Click commands |
| Debug pipeline | Check Koza logs, run `just check-config` |
| Update dependencies | Use `uv` package manager, maintain Koza/KGX compatibility |
**High priority** (core ingest logic):
**Medium priority**:
**Low priority**:
**download.yaml**:
```yaml
sources:
- name: "gene_data"
url: "https://example.org/genes.csv"
format: "csv"
```
**transform.yaml**:
```yaml
name: "gene_transform"
reader:
format: "csv"
files: ["data/genes.csv"]
transform:
code: "./transform.py"
writer:
format: "jsonl"
node_properties: ["id", "name", "category"]
min_node_count: 50
```
**transform.py**:
```python
from typing import Any
from koza import KozaTransform
from biolink_model.datamodel.pydanticmodel_v2 import Gene
@koza.transform_record()
def transform_record(koza_transform: KozaTransform, row: dict[str, Any]) -> list[Gene]:
if not row.get('gene_id'):
return []
gene = Gene(
id=f"HGNC:{row['gene_id']}",
name=row['gene_name'],
category=['biolink:Gene']
)
return [gene]
```
This project transforms raw biological data (genes, phenotypes, mouse models, etc.) into standardized knowledge graph formats compliant with the Biolink Model. The Koza framework handles the ETL pipeline, and outputs are used by the Monarch Initiative for data integration and querying.
Leave a review
No reviews yet. Be the first to review this skill!
# Download SKILL.md from killerskills.ai/api/skills/monarch-initiative-mmrrc-data-ingest/raw