Convert raw Nanopore signal data (FAST5/POD5) to nucleotide sequences using Dorado basecaller. Covers model selection, GPU acceleration, modified base detection, and quality filtering. Use when processing raw Nanopore data before alignment. Note - Guppy is deprecated; use Dorado for all new analyses.
Convert raw electrical signal from Nanopore sequencing into nucleotide sequences using Dorado basecaller. This skill covers the complete workflow from raw signal files to quality-filtered FASTQ/BAM output.
This skill helps you convert raw Nanopore sequencing data (FAST5 or POD5 format) into basecalled nucleotide sequences. It covers:
Use this skill when you need to:
First, check whether you have POD5 (modern) or FAST5 (legacy) files:
```bash
ls -lh input_dir/
if ls input_dir/*.fast5 1> /dev/null 2>&1; then
pod5 convert fast5 input_dir/*.fast5 --output pod5_output/
INPUT_DIR="pod5_output"
else
INPUT_DIR="input_dir"
fi
```
Select a model based on your accuracy vs. speed requirements:
**Model comparison:**
List all available models:
```bash
dorado download --list
```
**Basic basecalling (automatic model selection):**
```bash
dorado basecaller sup $INPUT_DIR > calls.bam
```
**With specific model version:**
```bash
dorado download --model [email protected]
dorado basecaller [email protected] $INPUT_DIR > calls.bam
```
**Output FASTQ instead of BAM:**
```bash
dorado basecaller sup $INPUT_DIR --emit-fastq > calls.fastq
```
**Specify GPU device:**
```bash
dorado basecaller sup $INPUT_DIR --device cuda:0 > calls.bam
dorado basecaller sup $INPUT_DIR --device cuda:0,1 > calls.bam
dorado basecaller sup $INPUT_DIR --device cpu > calls.bam
```
**Adjust batch size if running out of memory:**
```bash
dorado basecaller sup $INPUT_DIR --batchsize 32 > calls.bam
```
**Monitor GPU usage:**
```bash
watch -n 1 nvidia-smi
```
If you need to detect DNA modifications like methylation:
```bash
dorado basecaller sup,5mCG_5hmCG $INPUT_DIR > calls_mods.bam
dorado basecaller sup,5mCG $INPUT_DIR > calls_5mc.bam
dorado basecaller sup,6mA $INPUT_DIR > calls_6ma.bam
```
For higher accuracy when you have duplex reads:
```bash
dorado duplex sup $INPUT_DIR > duplex.bam
```
If you used barcoded samples:
```bash
dorado basecaller sup $INPUT_DIR --kit-name SQK-NBD114-24 > calls.bam
dorado demux calls.bam --output-dir demuxed/ --kit-name SQK-NBD114-24
```
Filter reads by quality score and length:
```bash
samtools fastq calls.bam | gzip > calls.fastq.gz
gunzip -c calls.fastq.gz | chopper -q 10 -l 500 | gzip > filtered.fastq.gz
gunzip -c calls.fastq.gz | NanoFilt -q 10 -l 500 | gzip > filtered.fastq.gz
```
Generate QC reports to assess basecalling quality:
```bash
NanoPlot --fastq filtered.fastq.gz -o qc_report/ --plots hex dot
NanoPlot --bam calls.bam -o qc_report/
seqkit stats filtered.fastq.gz
```
Save this as `basecall_pipeline.sh`:
```bash
#!/bin/bash
INPUT=$1
OUTPUT=$2
MODEL=${3:-sup}
mkdir -p $OUTPUT
if [ -d "$INPUT/fast5" ]; then
echo "Converting FAST5 to POD5..."
pod5 convert fast5 $INPUT/fast5/*.fast5 --output $OUTPUT/pod5/
INPUT_DIR="$OUTPUT/pod5"
else
INPUT_DIR="$INPUT"
fi
echo "Basecalling with $MODEL model..."
dorado basecaller $MODEL $INPUT_DIR > $OUTPUT/calls.bam
echo "Converting to FASTQ..."
samtools fastq $OUTPUT/calls.bam | gzip > $OUTPUT/calls.fastq.gz
echo "Filtering (Q≥10, len≥500)..."
gunzip -c $OUTPUT/calls.fastq.gz | chopper -q 10 -l 500 | gzip > $OUTPUT/filtered.fastq.gz
echo "Generating QC report..."
NanoPlot --fastq $OUTPUT/filtered.fastq.gz -o $OUTPUT/qc/
echo "Done! Filtered reads: $OUTPUT/filtered.fastq.gz"
```
Run it:
```bash
bash basecall_pipeline.sh /path/to/raw_data /path/to/output sup
```
**R10.4.1 (current):**
**R9.4.1 (legacy):**
| Model | VRAM Required | Typical Speed |
|-------|--------------|---------------|
| fast | 4 GB | ~450 bases/s |
| hac | 8 GB | ~200 bases/s |
| sup | 12 GB | ~50 bases/s |
**Out of memory:**
```bash
dorado basecaller sup $INPUT_DIR --batchsize 32 > calls.bam
```
**Slow CPU basecalling:**
```bash
dorado basecaller fast $INPUT_DIR --device cpu > calls.bam
```
**Resume interrupted run:**
```bash
dorado basecaller sup $INPUT_DIR --resume-from calls.bam > calls_complete.bam
```
After basecalling, you typically proceed to:
1. **long-read-alignment** - Align reads to reference genome
2. **long-read-qc** - Additional QC and read statistics
3. **structural-variants** - Detect structural variations
4. **medaka-polishing** - Polish assemblies using basecalled reads
Leave a review
No reviews yet. Be the first to review this skill!
# Download SKILL.md from killerskills.ai/api/skills/nanopore-basecalling/raw