Dumbo LLM Trainer Assistant
Expert assistant for working with Dumbo, a modular training framework built on top of Transformers for fine-tuning language models.
What This Skill Does
This skill provides deep knowledge of the Dumbo LLM training framework's architecture, plugin system, configuration patterns, and development workflows. It helps you:
Configure training runs with YAML filesDevelop custom plugins for models, tokenizers, datasets, and trainersUnderstand the plugin loading pipeline and execution orderDebug training issues and optimize configurationsIntegrate metrics collection and loggingWork with the modular plugin architectureInstructions
When assisting with Dumbo-related tasks, follow these guidelines:
1. Understand the Context
Before making changes, review:
The plugin-based architecture where all functionality comes from plugins in `src/dumbo/plugins/`The execution pipeline: Model → Tokenizer → Dataset → Training → OutputThe YAML configuration structure that defines the entire training setupThe metrics system with abstract collectors and registry pattern2. Configuration Tasks
When creating or modifying training configurations:
Use the YAML structure with sections: `model`, `datasets`, `trainer`, `plugins`Ensure required plugins are listed in the `plugins` arrayFollow the configuration examples in `examples/` directoryRemember tokenizer special tokens must be configured explicitlyUse plugin-specific config keys (e.g., `liger`, `peft`) under the model sectionExample configuration pattern:
```yaml
model:
base_model: org/model-name
tokenizer:
pad_token: "<|pad|>"
plugin_config:
setting: value
datasets:
- path: dataset/path
type: loader_type
train_format:
type: formatter_type
trainer:
arguments:
batch_size: 16
plugins:
- required_plugin_1
- required_plugin_2
```
3. Plugin Development
When creating or modifying plugins:
All plugins inherit from `BasePlugin` in `src/dumbo/plugin_loader.py`Implement the appropriate interface: `ModelLoaderPlugin`, `TokenizerLoaderPlugin`, `ModelPatcherPlugin`, or `LoggingPlugin`Register plugins by placing them in `src/dumbo/plugins/` directoryUnderstand the loading order: Model → Tokenizer → Patches → Datasets → TrainerFor metrics collection, implement `get_metrics_collector()` method returning a `MetricsCollector` instancePlugin interface examples:
`ModelLoaderPlugin`: Implement `load_model(config)` `TokenizerLoaderPlugin`: Implement `load_tokenizer(config, model=None)` - receives model for embedding resizing`ModelPatcherPlugin`: Implement `patch_model(model, config)`Metrics: Implement `get_metrics_collector()` returning collector instance4. Running Training
Use these command patterns:
```bash
Basic training run
uv run dumbo path/to/config.yaml
Development mode
uv run python -m dumbo path/to/config.yaml
Install dependencies first
uv sync
```
5. Key Files Reference
When investigating issues or making changes:
`src/dumbo/__init__.py`: Main orchestration and entry point`src/dumbo/plugin_loader.py`: Plugin system base classes and loading logic`src/dumbo/metrics.py`: Abstract metrics collection system`src/dumbo/plugins/transformers.py`: Core model/tokenizer loading with embedding resizing`src/dumbo/plugins/transformers_trainer.py`: Training setup and execution`src/dumbo/plugins/liger.py`: Liger kernel optimizations`src/dumbo/plugins/wandb.py`: W&B logging with metrics6. Common Patterns
**Embedding Resizing**: Tokenizer loading receives the model reference to support automatic embedding resizing when special tokens are added.
**Plugin Loading Order**: Critical for proper initialization - model must load before tokenizer, patches apply after both are loaded, trainer created last.
**Configuration Inheritance**: Plugins read their config from dedicated sections (e.g., `model.liger` for Liger plugin, `model.peft` for PEFT/LoRA).
**Metrics Collection**: Abstract system allows multiple collectors to be registered and used throughout training without tight coupling.
7. Troubleshooting
When debugging issues:
Check plugin loading order if initialization failsVerify all required plugins are listed in config `plugins` arrayEnsure special tokens are properly configured in `model.tokenizer` sectionReview plugin-specific config sections match plugin expectationsCheck that model and tokenizer are compatibleVerify dataset formatter matches expected data structureExamples
Example 1: Create a new training configuration
```yaml
model:
base_model: HuggingFaceTB/SmolLM2-135M
tokenizer:
pad_token: "<|pad|>"
eos_token: "<|im_end|>"
liger:
rope: true
cross_entropy: false
datasets:
- path: tatsu-lab/alpaca
type: huggingface_polars
data_format: alpaca
train_format:
type: jinja_messages
template: "{% for message in messages %}{{ message.content }}{% endfor %}"
trainer:
arguments:
batch_size: 16
physical_batch_size: 1
learning_rate: 1e-4
num_train_epochs: 3
plugins:
- transformers
- transformers_trainer
- liger
- polars
- jinja_formatter
```
Example 2: Develop a custom metrics collector plugin
```python
from dumbo.plugin_loader import BasePlugin
from dumbo.metrics import MetricsCollector
class CustomMetricsPlugin(BasePlugin):
def get_metrics_collector(self) -> MetricsCollector:
return CustomMetricsCollector()
class CustomMetricsCollector(MetricsCollector):
def log_metrics(self, metrics: dict, step: int):
# Custom metrics logging logic
pass
```
Important Notes
The plugin system is the core architectural pattern - all functionality comes from pluginsConfiguration files use YAML format and define the complete training pipelinePlugin loading order matters: Model → Tokenizer → Patches → Datasets → TrainerTokenizer loading receives model reference for embedding resizing supportMetrics system uses abstract collectors registered via plugin hooksAlways list required plugins explicitly in the config `plugins` array