Mixture of Experts Question Answering System

This skill implements a Multi-Expert (MOE) Question Answering System that intelligently routes user questions to specialized AI models based on the domain (programming, biology, mathematics). It uses the Qwen2-1.5B model series with dynamic model loading for memory efficiency.

What This Skill Does

Creates an intelligent question-answering system that:

Maintains a "director" model for question classification

Dynamically loads expert models only when needed

Uses keyword matching for fast domain detection

Falls back to AI classification for ambiguous questions

Manages GPU/CPU memory efficiently by loading only one expert at a time

Provides a conversational chat interface

Step-by-Step Instructions

1. Initialize the MOE System Architecture

Set up the core components:

Define a `MODEL_CONFIG` dictionary mapping expert domains to their HuggingFace model identifiers

Include these domains: director, programming, biology, mathematics

Specify the task type (text-generation) for each model

Use Qwen2-1.5B variants optimized for each domain

2. Create Keyword Dictionaries

Build domain-specific keyword lists:

For **biology**: cell, DNA, protein, evolution, genetics, ecosystem, organism, metabolism, photosynthesis, microbiology (include Spanish translations)

For **mathematics**: equation, integral, derivative, function, geometry, algebra, statistics, probability (include Spanish translations)

For **programming**: python, java, C++, HTML, code, API, framework, debugging, algorithm, database, Git, machine learning, etc.

Store in a `KEYWORDS` dictionary keyed by domain name

3. Implement the MOELLM Class

Create the main orchestrator class with these methods:

**Initialization (`__init__`)**:

Detect CUDA availability and set device (cuda/cpu)

Initialize current_expert, current_model, current_tokenizer to None

Immediately load the director model

**Director Model Loading (`load_director_model`)**:

Load the Agnuxo/Qwen2-1.5B-Instruct_MOE_Director_16bit model

Use torch.float16 for memory efficiency

Create a text-generation pipeline for the director

Print confirmation when loaded

**Dynamic Expert Loading (`load_expert_model`)**:

Check if requested expert differs from current_expert

If different, free memory: delete current model/tokenizer, call `torch.cuda.empty_cache()`

Load the new expert's tokenizer and model (torch.float16)

Update current_expert tracker

Return a pipeline for the loaded expert

4. Implement Question Routing Logic

**Keyword-Based Routing (`determine_expert_by_keywords`)**:

Convert question to lowercase

Iterate through KEYWORDS dictionary

Return first domain where any keyword matches

Return None if no keywords match

**Hybrid Routing (`determine_expert`)**:

First attempt keyword matching

If successful, return the expert immediately

If no keyword match, construct a classification prompt for the director

Prompt format: "Classify the following question into one of these categories: programming, biology, mathematics. Question: {question}\nCategory:"

Parse the director's response to extract the category

Validate the category exists in MODEL_CONFIG

Default to "director" if category invalid

5. Implement Response Generation

**Response Generation (`generate_response`)**:

Load the appropriate expert model using `load_expert_model`

Construct a prompt: "Answer the following question as an expert in {expert}: {question}\nAnswer:"

Generate response with max_length=200

Extract the answer portion after "Answer:" delimiter

Implement error handling with user-friendly error messages

6. Create the Chat Interface

**Chat Loop (`chat_interface`)**:

Display welcome message and exit instructions

Enter infinite loop reading user input

Break on 'exit' or 'quit' commands

For each question:

- Determine the expert using `determine_expert`

- Generate response using `generate_response`

- Display response prefixed with expert name

- Handle exceptions gracefully with retry prompts

7. Set Up the Entry Point

In the main block:

Instantiate MOELLM class

Call `chat_interface()` to start the interactive session

Implementation Notes

**Memory Management**: The system only keeps one expert model in memory at a time. When switching experts, explicitly delete the previous model and clear CUDA cache to prevent OOM errors.

**Model Precision**: Use `torch.float16` for all models to reduce memory footprint while maintaining acceptable accuracy.

**Fallback Strategy**: Keyword matching provides zero-latency routing for clear questions. The director model handles edge cases and ambiguous questions.

**Bilingual Support**: Keyword lists include Spanish translations to support multilingual question matching.

**Error Resilience**: Wrap model loading and generation in try-except blocks. Provide actionable error messages to users.

Example Usage

```python

User asks: "What is photosynthesis?"

System matches keyword "photosynthesis" → routes to biology expert

Biology model generates domain-specific answer

User asks: "How do I sort a list in Python?"

System matches keyword "python" → routes to programming expert

Programming model generates code example

User asks: "Explain quantum mechanics"

No keyword match → director classifies as general science

Falls back to director model for general response

```

Requirements

Python 3.8+

transformers

torch (with CUDA support recommended)

8GB+ GPU memory (for single expert + director) or CPU with 16GB+ RAM

Model Information

**Base Model**: Agnuxo/Qwen2-1.5B-Instruct_MOE_assistant_16bit

**Quantization**: GGUF 8-bit

**Training**: Fine-tuned with Unsloth for 2x training speed

**License**: Apache 2.0

References

Full implementation: https://huggingface.co/Agnuxo/Qwen2-1.5B-Instruct_MOE_Director_16bit/resolve/main/MOE-LLMs3.py

GitHub repository: https://github.com/Agnuxo1/NEBULA

Mixture of Experts Question Answering System

Mixture of Experts Question Answering System

What This Skill Does

Step-by-Step Instructions

1. Initialize the MOE System Architecture

2. Create Keyword Dictionaries

3. Implement the MOELLM Class

4. Implement Question Routing Logic

5. Implement Response Generation

6. Create the Chat Interface

7. Set Up the Entry Point

Implementation Notes

Example Usage

User asks: "What is photosynthesis?"

System matches keyword "photosynthesis" → routes to biology expert

Biology model generates domain-specific answer

User asks: "How do I sort a list in Python?"

System matches keyword "python" → routes to programming expert

Programming model generates code example

User asks: "Explain quantum mechanics"

No keyword match → director classifies as general science

Falls back to director model for general response

Requirements

Model Information

References

Reviews (0)