LightGBM EMBER2024 Antivirus Model Development

Development guidelines for the golden-scar repository, a testing suite for building LightGBM antivirus detection models using the EMBER2024 dataset with culling, drifting, and dynamic ensemble selection techniques.

Repository Context

This repository implements a sophisticated antivirus detection system with three core components:

**Culling scripts**: Find the most representative subsample from the full EMBER2024 dataset

**Drifting**: Use real-world data to analyze concept drift and create specialized "expert" models

**Routing**: Implement dynamic ensemble selection using DSEL and KNORA-U algorithms with expert models

Architecture Overview

Core Technology Stack

**Model Framework**: LightGBM for all model training

**Hyperparameter Optimization**: Random search with cross-validation

**Primary Dataset**: EMBER2024 (48GB, 4.68 million samples, 2500 features)

**Clustering Algorithms**: HDBSCAN and DBSCAN for data analysis

Environment Setup

**Python Environment**: Virtual environment located at `~/golden-scar/.venv/`

**EMBER2024 Package**: thrember package located at `~/golden-scar/EMBER2024/`

**Execution Pattern**: Always activate the virtual environment before running Python:

```bash

source ~/golden-scar/.venv/bin/activate && python3 <script>

```

Development Workflow

Pre-Task Planning Requirements

Before starting any task, you MUST complete these steps:

1. **Provide a Full Plan**: Outline all changes you intend to make with clear steps

2. **List Behavioral Changes**: Document what behaviors will change as a result of your modifications

This ensures thoughtful, deliberate development and prevents unnecessary complexity.

Code Reuse Strategy

Before writing new code, always:

Check if existing code can be reused or reconfigured

Look for opportunities to extend existing functions rather than creating new ones

Consider whether configuration changes can achieve the desired result

Coding Guidelines

Core Principles

1. **Simplicity Over Comprehensiveness**: Focus on precise, simple solutions rather than exhaustive implementations

2. **DRY Principle**: Avoid code duplication; extract shared logic into reusable functions

3. **Test-Driven Bug Fixes**: When fixing bugs, always write a failing test first to verify the issue

4. **Shared Constants**: Replace hard-coded numbers with named constants defined in a shared location

Documentation Standards

**No Unsolicited Markdown**: Do not create markdown files explaining changes unless explicitly requested

**No Emojis**: Keep all code and documentation emoji-free for professional consistency

**Code Comments**: Use clear, concise comments where logic is non-obvious

Python-Specific Guidelines

**Environment Activation**: Always use the virtual environment at `~/golden-scar/.venv/`

**EMBER2024 Integration**: Leverage the thrember package for all EMBER2024 dataset operations

**Import Organization**: Group imports logically (standard library, third-party, local)

Dataset Handling

EMBER2024 Characteristics

**Size**: ~48GB

**Samples**: 4.68 million malware/benign samples

**Features**: 2500 dimensional feature vectors

**Package**: Use thrember for loading and processing

Clustering Approaches

**HDBSCAN**: For hierarchical density-based clustering

**DBSCAN**: For density-based spatial clustering with noise handling

Use these algorithms for finding representative subsamples and analyzing data distributions.

Model Development

Training Pipeline

1. Load data using thrember package

2. Apply culling to create representative subsample (if needed)

3. Configure LightGBM parameters

4. Run hyperparameter optimization via random search with cross-validation

5. Train final model with optimal parameters

6. Evaluate on holdout set

Hyperparameter Optimization

Use random search over predefined parameter spaces

Apply k-fold cross-validation for robust performance estimates

Log all trials for reproducibility

Store optimal parameters in configuration files

Expert Models and Routing

Train specialized "expert" models on drifted real-world data

Implement DSEL (Dynamic Selection) and KNORA-U algorithms

Create routing logic to dynamically select the best expert for each prediction

Evaluate ensemble performance against single-model baselines

Example Workflow

```bash

Activate environment and run culling script

source ~/golden-scar/.venv/bin/activate && python3 scripts/cull_dataset.py --method hdbscan --sample-size 100000

Train base model with hyperparameter search

source ~/golden-scar/.venv/bin/activate && python3 scripts/train_model.py --optimize --cv-folds 5

Analyze concept drift on real-world samples

source ~/golden-scar/.venv/bin/activate && python3 scripts/analyze_drift.py --input real_world_samples.csv

Train expert models for drifted regions

source ~/golden-scar/.venv/bin/activate && python3 scripts/train_experts.py --drift-clusters drift_output/

Create dynamic ensemble with routing

source ~/golden-scar/.venv/bin/activate && python3 scripts/create_ensemble.py --method knora-u --experts models/experts/

```

Key Constraints

**Environment Isolation**: Never run Python scripts outside the virtual environment

**Dataset Location**: Always reference EMBER2024 data through the thrember package

**Resource Awareness**: Be mindful of the 48GB dataset size when loading into memory

**Reproducibility**: Set random seeds for all stochastic operations (clustering, cross-validation, random search)

Testing Strategy

Write unit tests for data processing functions

Create integration tests for end-to-end pipelines

Always write failing tests before fixing bugs

Maintain test coverage for critical paths (culling, training, routing)

LightGBM EMBER2024 Antivirus Model Development

Safety Concern

LightGBM EMBER2024 Antivirus Model Development

Repository Context

Architecture Overview

Core Technology Stack

Environment Setup

Development Workflow

Pre-Task Planning Requirements

Code Reuse Strategy

Coding Guidelines

Core Principles

Documentation Standards

Python-Specific Guidelines

Dataset Handling

EMBER2024 Characteristics

Clustering Approaches

Model Development

Training Pipeline

Hyperparameter Optimization

Expert Models and Routing

Example Workflow

Activate environment and run culling script

Train base model with hyperparameter search

Analyze concept drift on real-world samples

Train expert models for drifted regions

Create dynamic ensemble with routing

Key Constraints

Testing Strategy

Reviews (0)