LightGBM EMBER2024 Antivirus Model Development
Development guidelines for the golden-scar repository, a testing suite for building LightGBM antivirus detection models using the EMBER2024 dataset with culling, drifting, and dynamic ensemble selection techniques.
Repository Context
This repository implements a sophisticated antivirus detection system with three core components:
**Culling scripts**: Find the most representative subsample from the full EMBER2024 dataset**Drifting**: Use real-world data to analyze concept drift and create specialized "expert" models**Routing**: Implement dynamic ensemble selection using DSEL and KNORA-U algorithms with expert modelsArchitecture Overview
Core Technology Stack
**Model Framework**: LightGBM for all model training**Hyperparameter Optimization**: Random search with cross-validation**Primary Dataset**: EMBER2024 (48GB, 4.68 million samples, 2500 features)**Clustering Algorithms**: HDBSCAN and DBSCAN for data analysisEnvironment Setup
**Python Environment**: Virtual environment located at `~/golden-scar/.venv/`**EMBER2024 Package**: thrember package located at `~/golden-scar/EMBER2024/`**Execution Pattern**: Always activate the virtual environment before running Python:```bash
source ~/golden-scar/.venv/bin/activate && python3 <script>
```
Development Workflow
Pre-Task Planning Requirements
Before starting any task, you MUST complete these steps:
1. **Provide a Full Plan**: Outline all changes you intend to make with clear steps
2. **List Behavioral Changes**: Document what behaviors will change as a result of your modifications
This ensures thoughtful, deliberate development and prevents unnecessary complexity.
Code Reuse Strategy
Before writing new code, always:
Check if existing code can be reused or reconfiguredLook for opportunities to extend existing functions rather than creating new onesConsider whether configuration changes can achieve the desired resultCoding Guidelines
Core Principles
1. **Simplicity Over Comprehensiveness**: Focus on precise, simple solutions rather than exhaustive implementations
2. **DRY Principle**: Avoid code duplication; extract shared logic into reusable functions
3. **Test-Driven Bug Fixes**: When fixing bugs, always write a failing test first to verify the issue
4. **Shared Constants**: Replace hard-coded numbers with named constants defined in a shared location
Documentation Standards
**No Unsolicited Markdown**: Do not create markdown files explaining changes unless explicitly requested**No Emojis**: Keep all code and documentation emoji-free for professional consistency**Code Comments**: Use clear, concise comments where logic is non-obviousPython-Specific Guidelines
**Environment Activation**: Always use the virtual environment at `~/golden-scar/.venv/`**EMBER2024 Integration**: Leverage the thrember package for all EMBER2024 dataset operations**Import Organization**: Group imports logically (standard library, third-party, local)Dataset Handling
EMBER2024 Characteristics
**Size**: ~48GB**Samples**: 4.68 million malware/benign samples**Features**: 2500 dimensional feature vectors**Package**: Use thrember for loading and processingClustering Approaches
**HDBSCAN**: For hierarchical density-based clustering**DBSCAN**: For density-based spatial clustering with noise handlingUse these algorithms for finding representative subsamples and analyzing data distributions.
Model Development
Training Pipeline
1. Load data using thrember package
2. Apply culling to create representative subsample (if needed)
3. Configure LightGBM parameters
4. Run hyperparameter optimization via random search with cross-validation
5. Train final model with optimal parameters
6. Evaluate on holdout set
Hyperparameter Optimization
Use random search over predefined parameter spacesApply k-fold cross-validation for robust performance estimatesLog all trials for reproducibilityStore optimal parameters in configuration filesExpert Models and Routing
Train specialized "expert" models on drifted real-world dataImplement DSEL (Dynamic Selection) and KNORA-U algorithmsCreate routing logic to dynamically select the best expert for each predictionEvaluate ensemble performance against single-model baselinesExample Workflow
```bash
Activate environment and run culling script
source ~/golden-scar/.venv/bin/activate && python3 scripts/cull_dataset.py --method hdbscan --sample-size 100000
Train base model with hyperparameter search
source ~/golden-scar/.venv/bin/activate && python3 scripts/train_model.py --optimize --cv-folds 5
Analyze concept drift on real-world samples
source ~/golden-scar/.venv/bin/activate && python3 scripts/analyze_drift.py --input real_world_samples.csv
Train expert models for drifted regions
source ~/golden-scar/.venv/bin/activate && python3 scripts/train_experts.py --drift-clusters drift_output/
Create dynamic ensemble with routing
source ~/golden-scar/.venv/bin/activate && python3 scripts/create_ensemble.py --method knora-u --experts models/experts/
```
Key Constraints
**Environment Isolation**: Never run Python scripts outside the virtual environment**Dataset Location**: Always reference EMBER2024 data through the thrember package**Resource Awareness**: Be mindful of the 48GB dataset size when loading into memory**Reproducibility**: Set random seeds for all stochastic operations (clustering, cross-validation, random search)Testing Strategy
Write unit tests for data processing functionsCreate integration tests for end-to-end pipelinesAlways write failing tests before fixing bugsMaintain test coverage for critical paths (culling, training, routing)