Nibi Cluster Archive Pipeline

A comprehensive guide for navigating and managing the file-based OCR pipeline on the Nibi HPC cluster. This skill helps you monitor, troubleshoot, and operate the Archive.org document processing pipeline with OLMoCR batch jobs.

What This Skill Does

This skill provides operational guidance for:

SSH access and cluster navigation

Job monitoring and management via SLURM

File-based pipeline architecture understanding

Progress tracking across 5-stage processing flow

Troubleshooting common issues

Emergency recovery procedures

Pipeline Architecture

The system uses a **file-based state machine** (no SQLite) with 5 stages:

1. **01_downloaded/** - PDFs + metadata from Archive.org

2. **02_ocr_pending/** - Symlinks ready for OCR

3. **03_ocr_processing/** - Active batch directories

4. **04_ocr_completed/** - Split OCR results (per-page JSON)

5. **05_processed/** - Final processed documents

Three concurrent processes: Downloader → Dispatcher → Cleanup Worker

Instructions

1. Initial Connection

```bash

Connect from WSL

ssh nibi

If SSH hangs, check for stale multiplexer

ps aux | grep ssh

kill [PID]

```

2. Check Pipeline Status

```bash

BASE="/home/jic823/projects/def-jic823/caribbean_pipeline"

Quick counts

echo "Downloaded: $(ls $BASE/01_downloaded/*.pdf 2>/dev/null | wc -l)"

echo "Pending OCR: $(ls $BASE/02_ocr_pending/*.pdf 2>/dev/null | wc -l)"

echo "Processing: $(ls -d $BASE/03_ocr_processing/batch_* 2>/dev/null | wc -l)"

echo "Completed: $(ls $BASE/04_ocr_completed/*.jsonl 2>/dev/null | wc -l)"

echo "Final: $(ls $BASE/05_processed/*.json 2>/dev/null | wc -l)"

```

3. Monitor Active Jobs

```bash

List running jobs

squeue -u jic823

Detailed job info

squeue -u jic823 -l

Find and tail SLURM output

ls -lth ~/projects/def-jic823/slurm-*.out | head -5

tail -f ~/projects/def-jic823/slurm-[JOBID].out

```

4. Check Progress Manifests

```bash

Download progress

cat $BASE/_manifests/download_progress.json | jq '.'

OCR batch status

cat $BASE/_manifests/batches.json | jq '.batches[] | {batch_id, total_pdfs, status}'

Error count

find $BASE/99_errors -name "*.error.json" 2>/dev/null | wc -l

```

5. Verify OLMoCR Jobs

**CRITICAL:** Before submitting any OLMoCR job, read:

```

docs/OLMOCR_BEST_PRACTICES.md

```

This prevents silent failures where jobs process 1 PDF instead of hundreds.

```bash

Check OCR jobs

squeue -u jic823 | grep olmocr

Verify batch output

cat $BASE/03_ocr_processing/batch_0001/logs/*.log

```

6. Restart Pipeline After Failure

```bash

cd ~/projects/def-jic823/archive-olm-pipeline

git pull

Fully resumable - picks up from _manifests/download_progress.json

CONFIG_FILE=config/caribbean_filebased.yaml sbatch streaming/run_filebased_pipeline.sh

```

7. Troubleshooting

**Dispatcher not creating batches:**

```bash

Check pending count (triggers at 200 PDFs by default)

ls $BASE/02_ocr_pending/*.pdf | wc -l

Look for dispatcher log entries

grep "Dispatcher" ~/projects/def-jic823/slurm-[JOBID].out | tail -20

```

**Download stalled:**

```bash

Check progress

cat $BASE/_manifests/download_progress.json | jq '.current_index, .total_downloaded'

Check disk space (pauses at >90%)

df -h /home/jic823/projects/def-jic823

```

**Broken symlinks:**

```bash

find $BASE/02_ocr_pending -xtype l -delete

```

8. Health Indicators

**Healthy pipeline shows:**

✅ PDFs accumulating in 01_downloaded/

✅ Symlinks created in 02_ocr_pending/

✅ Batches submitted at 200 PDF threshold

✅ OCR jobs in squeue

✅ Results in 04_ocr_completed/

**Warning signs:**

⚠️ No new PDFs >10 minutes

⚠️ 200+ pending but no batch creation >5 minutes

⚠️ Growing error count in 99_errors/

⚠️ Disk >90% full

9. Emergency Commands

```bash

Cancel all jobs

scancel -u jic823

Kill hung SSH multiplexer

pkill -f "ssh.*nibi.*ControlMaster"

Nuclear reset (DELETES ALL PROGRESS)

rm -rf ~/projects/def-jic823/caribbean_pipeline

./setup_filebased_pipeline.sh

```

Key Configuration

**Pipeline config:** `config/caribbean_filebased.yaml`

`pdfs_per_batch: 200` - Trigger batch every 200 PDFs

`auto_delete_pdfs: true` - Save space after OCR

`check_interval: 60` - Poll frequency in seconds

Repository Locations

**Local (WSL):** `/home/jic823/archive-olm-pipeline`

**Cluster pipeline:** `~/projects/def-jic823/archive-olm-pipeline`

**Pipeline data:** `~/projects/def-jic823/caribbean_pipeline`

**OLMoCR:** `~/projects/def-jic823/olmocr/`

Performance Expectations

**Download:** ~5-10 PDFs/minute

**Batch size:** 200 PDFs per OCR job

**Batch time:** ~1-2 hours (varies by page count)

**100K PDFs:** ~2-3 weeks end-to-end

Constraints

File-based architecture (SQLite abandoned due to NFS corruption at scale)

Fully resumable via JSON manifests

Git workflow: develop locally → push to GitLab → pull on cluster → submit

NFS limitations: no file locking, avoid databases

Space management: auto-delete PDFs after OCR if configured

Important Notes

1. **Always read** `docs/OLMOCR_BEST_PRACTICES.md` before submitting OCR jobs

2. Pipeline is fully resumable - safe to restart at any time

3. All state tracked in `_manifests/` JSON files, not SQLite

4. Dispatcher triggers based on PDF count (200), not page count

5. Cleanup worker auto-deletes PDFs to save space if enabled

Nibi Cluster Archive Pipeline

Nibi Cluster Archive Pipeline

What This Skill Does

Pipeline Architecture

Instructions

1. Initial Connection

Connect from WSL

If SSH hangs, check for stale multiplexer

2. Check Pipeline Status

Quick counts

3. Monitor Active Jobs

List running jobs

Detailed job info

Find and tail SLURM output

4. Check Progress Manifests

Download progress

OCR batch status

Error count

5. Verify OLMoCR Jobs

Check OCR jobs

Verify batch output

6. Restart Pipeline After Failure

Fully resumable - picks up from _manifests/download_progress.json

7. Troubleshooting

Check pending count (triggers at 200 PDFs by default)

Look for dispatcher log entries

Check progress

Check disk space (pauses at >90%)

8. Health Indicators

9. Emergency Commands

Cancel all jobs

Kill hung SSH multiplexer

Nuclear reset (DELETES ALL PROGRESS)

Key Configuration

Repository Locations

Performance Expectations

Constraints

Important Notes

Reviews (0)