Navigate and manage file-based OCR pipeline on Nibi cluster with OLMoCR batch processing
A comprehensive guide for navigating and managing the file-based OCR pipeline on the Nibi HPC cluster. This skill helps you monitor, troubleshoot, and operate the Archive.org document processing pipeline with OLMoCR batch jobs.
This skill provides operational guidance for:
The system uses a **file-based state machine** (no SQLite) with 5 stages:
1. **01_downloaded/** - PDFs + metadata from Archive.org
2. **02_ocr_pending/** - Symlinks ready for OCR
3. **03_ocr_processing/** - Active batch directories
4. **04_ocr_completed/** - Split OCR results (per-page JSON)
5. **05_processed/** - Final processed documents
Three concurrent processes: Downloader → Dispatcher → Cleanup Worker
```bash
ssh nibi
ps aux | grep ssh
kill [PID]
```
```bash
BASE="/home/jic823/projects/def-jic823/caribbean_pipeline"
echo "Downloaded: $(ls $BASE/01_downloaded/*.pdf 2>/dev/null | wc -l)"
echo "Pending OCR: $(ls $BASE/02_ocr_pending/*.pdf 2>/dev/null | wc -l)"
echo "Processing: $(ls -d $BASE/03_ocr_processing/batch_* 2>/dev/null | wc -l)"
echo "Completed: $(ls $BASE/04_ocr_completed/*.jsonl 2>/dev/null | wc -l)"
echo "Final: $(ls $BASE/05_processed/*.json 2>/dev/null | wc -l)"
```
```bash
squeue -u jic823
squeue -u jic823 -l
ls -lth ~/projects/def-jic823/slurm-*.out | head -5
tail -f ~/projects/def-jic823/slurm-[JOBID].out
```
```bash
cat $BASE/_manifests/download_progress.json | jq '.'
cat $BASE/_manifests/batches.json | jq '.batches[] | {batch_id, total_pdfs, status}'
find $BASE/99_errors -name "*.error.json" 2>/dev/null | wc -l
```
**CRITICAL:** Before submitting any OLMoCR job, read:
```
docs/OLMOCR_BEST_PRACTICES.md
```
This prevents silent failures where jobs process 1 PDF instead of hundreds.
```bash
squeue -u jic823 | grep olmocr
cat $BASE/03_ocr_processing/batch_0001/logs/*.log
```
```bash
cd ~/projects/def-jic823/archive-olm-pipeline
git pull
CONFIG_FILE=config/caribbean_filebased.yaml sbatch streaming/run_filebased_pipeline.sh
```
**Dispatcher not creating batches:**
```bash
ls $BASE/02_ocr_pending/*.pdf | wc -l
grep "Dispatcher" ~/projects/def-jic823/slurm-[JOBID].out | tail -20
```
**Download stalled:**
```bash
cat $BASE/_manifests/download_progress.json | jq '.current_index, .total_downloaded'
df -h /home/jic823/projects/def-jic823
```
**Broken symlinks:**
```bash
find $BASE/02_ocr_pending -xtype l -delete
```
**Healthy pipeline shows:**
**Warning signs:**
```bash
scancel -u jic823
pkill -f "ssh.*nibi.*ControlMaster"
rm -rf ~/projects/def-jic823/caribbean_pipeline
./setup_filebased_pipeline.sh
```
**Pipeline config:** `config/caribbean_filebased.yaml`
1. **Always read** `docs/OLMOCR_BEST_PRACTICES.md` before submitting OCR jobs
2. Pipeline is fully resumable - safe to restart at any time
3. All state tracked in `_manifests/` JSON files, not SQLite
4. Dispatcher triggers based on PDF count (200), not page count
5. Cleanup worker auto-deletes PDFs to save space if enabled
Leave a review
No reviews yet. Be the first to review this skill!
# Download SKILL.md from killerskills.ai/api/skills/nibi-cluster-archive-pipeline/raw