Mallorn Challenge Time-Series Classification

AI assistant specialized in the Mallorn Challenge project for astronomical lightcurve time-series classification using LightGBM.

Purpose

This skill helps you work with a time-series classification project that processes astronomical lightcurve data. The project uses feature engineering, preprocessing, and LightGBM model training to classify astronomical objects.

Project Structure

The project follows this organization:

**`src/`**: Production code for feature engineering, preprocessing, and training

**`data/`**: Raw and processed data (not version controlled)

**`notebooks/`**: Exploratory data analysis, feature analysis, and experiment documentation

**`output/`**: Model artifacts, predictions, and submission files

Key Workflows

1. Feature Engineering

**Main script**: `src/make_features.py`

Aggregates features from grouped lightcurve data

Uses feature modules in `src/features/` subdirectory

Outputs versioned processed datasets: `data/processed/train_final_<version>.csv` and `test_final_<version>.csv`

**To generate features**:

```bash

python src/make_features.py

```

2. Preprocessing

**Main script**: `src/preprocessing.py`

Merges split-wise data from multiple `split_*/` directories

Applies extinction correction for astrophysical effects

Cleans and prepares data for feature engineering

**To preprocess data**:

```bash

python src/preprocessing.py

```

3. Model Training

**Main script**: `src/train.py`

Loads processed feature datasets

Trains LightGBM classifier with 5-fold cross-validation

Generates out-of-fold (OOF) predictions, feature importances, and submission files

Outputs all artifacts to `output/` directory

**To train model**:

```bash

python src/train.py

```

Data Flow

Follow this pipeline sequence:

1. Raw data in `data/raw/`

2. Preprocessing (`src/preprocessing.py`)

3. Feature engineering (`src/make_features.py`)

4. Model training (`src/train.py`)

5. Output artifacts in `output/`

Data splits (`split_*/` directories) are merged and processed for both training and test datasets.

Configuration

**All paths and directories are centralized in `src/config.py`**

Always import and use configuration from this file

Do not hardcode paths in scripts or notebooks

Respect the configured directory structure

Code Conventions

1. **Vietnamese comments**: Some code and notebooks contain Vietnamese documentation and logging

2. **Feature selection**: Reference `notebooks/correlation-reduction-strategy.md` for domain-driven feature dropping rationale

3. **Notebooks**: Used for exploratory analysis and diagnostics, not production pipelines

4. **No automated tests**: Validation is manual via notebook outputs and script results

5. **All scripts run from project root**: Execute commands from the top-level directory

Integration Points

**Key dependencies**:

`lightgbm`: Gradient boosting framework

`scikit-learn`: Machine learning utilities

`extinction`: Astrophysics-specific corrections for interstellar dust effects

**Extensibility**: Feature modules in `src/features/` can be extended to add new feature types for lightcurve characterization.

Instructions for AI Agents

1. **Always check `src/config.py` first** for path conventions and directory structure

2. **Respect the data flow**: Never write directly to `data/raw/` or `output/` except through designated scripts

3. **Reference notebooks**: Check `notebooks/` for feature selection rationale, EDA context, and experimental results

4. **Run from root**: All scripts assume execution from the project root directory

5. **Feature context**: Review `notebooks/correlation-reduction-strategy.md` before modifying feature engineering logic

6. **Data versioning**: Processed datasets use version suffixes (e.g., `train_final_v2.csv`)

Examples

Generate features for a new dataset version:

```bash

python src/make_features.py

```

Run full pipeline from scratch:

```bash

python src/preprocessing.py

python src/make_features.py

python src/train.py

```

Important Notes

Raw data files are not version controlled and should remain in `data/raw/`

Model outputs (predictions, importances, submissions) go to `output/`

For unclear conventions or context, consult `README.md` and relevant notebooks

Vietnamese comments provide additional context for domain-specific logic

Mallorn Challenge Time-Series Classification

Mallorn Challenge Time-Series Classification

Purpose

Project Structure

Key Workflows

1. Feature Engineering

2. Preprocessing

3. Model Training

Data Flow

Configuration

Code Conventions

Integration Points

Instructions for AI Agents

Examples

Important Notes

Reviews (0)