Multi-Expert Question Answering System

A sophisticated mixture-of-experts (MOE) implementation that intelligently routes questions to domain-specific LLMs. The system uses a director model to classify questions and dynamically loads specialized expert models (programming, biology, mathematics) on-demand, optimizing both response quality and memory usage.

Overview

This skill implements a complete MOE question-answering system with:

A director LLM that classifies incoming questions

Keyword-based fast routing for common domains

Dynamic expert model loading/unloading to manage GPU memory

Specialized models for programming, biology, and mathematics

Interactive chat interface

The base model (Qwen2-1.5B-Instruct MOE Code Assistant) was fine-tuned using Unsloth and serves as the programming expert in this architecture.

How the System Works

1. **Initialization**: Load the director model and prepare expert model configurations

2. **Question Classification**: Use keyword matching first, fall back to director LLM if needed

3. **Dynamic Loading**: Load the appropriate expert model, releasing the previous one to free memory

4. **Response Generation**: The expert model generates a domain-specific answer

5. **Iteration**: System remains ready for the next question with minimal memory footprint

Instructions

When a user requests to implement or use this multi-expert question answering system:

1. **Environment Setup**

- Verify Python 3.8+ is available

- Check for CUDA availability (GPU recommended but not required)

- Install required dependencies: `transformers`, `torch`, `accelerate`

2. **Create the MOE System Implementation**

- Implement the `MOELLM` class with the following structure:

- `__init__`: Initialize device detection and load director model

- `load_director_model`: Load the Qwen2-based director model

- `load_expert_model`: Dynamically load expert models with memory management

- `determine_expert_by_keywords`: Fast keyword-based routing

- `determine_expert`: Full classification using director model

- `generate_response`: Generate domain-specific answers

- `chat_interface`: Interactive Q&A loop

3. **Configure Model Mappings**

```python

MODEL_CONFIG = {

"director": "Agnuxo/Qwen2-1.5B-Instruct_MOE_Director_16bit",

"programming": "Qwen/Qwen2-1.5B-Instruct",

"biology": "Agnuxo/Qwen2-1.5B-Instruct_MOE_BIOLOGY_assistant_16bit",

"mathematics": "Qwen/Qwen2-Math-1.5B-Instruct"

}

```

4. **Define Keyword Dictionaries**

- Create keyword sets for each domain (biology, mathematics, programming)

- Include multilingual keywords where applicable

- Ensure comprehensive coverage of domain-specific terminology

5. **Implement Memory Management**

- Use `del` to remove previous models before loading new ones

- Call `torch.cuda.empty_cache()` after model deletion

- Convert models to float16 to reduce memory footprint

- Only keep the director and one expert in memory at a time

6. **Create the Chat Interface**

- Implement a continuous input loop

- Show which expert is handling each question

- Provide clear error messages

- Allow graceful exit with 'exit' or 'quit'

7. **Error Handling**

- Wrap model loading in try-except blocks

- Handle unknown expert requests

- Catch generation errors and provide fallback messages

- Log errors while maintaining user-friendly responses

8. **Testing Strategy**

- Test with questions from each domain

- Verify keyword routing works correctly

- Confirm director fallback activates for ambiguous questions

- Monitor memory usage across expert switches

- Test edge cases (empty questions, very long prompts)

9. **Optimization Considerations**

- Use quantized models (GGUF, 4-bit) for lower memory usage

- Implement model caching if switching between same experts frequently

- Consider batch processing for multiple questions

- Add prompt templates for better expert performance

10. **Extension Points**

- Add new expert domains by updating MODEL_CONFIG and KEYWORDS

- Implement confidence scoring for routing decisions

- Add conversation history for context-aware responses

- Create a web API wrapper for remote access

- Integrate with vector databases for retrieval-augmented generation

Example Usage

```python

Initialize the MOE system

moe_llm = MOELLM()

Start interactive chat

moe_llm.chat_interface()

Example interactions:

User: "What is photosynthesis?"

System: [Routes to biology expert]

Biology: "Photosynthesis is the process by which plants..."

User: "How do I sort a list in Python?"

System: [Routes to programming expert]

Programming: "You can sort a list in Python using..."

User: "Calculate the derivative of x^2"

System: [Routes to mathematics expert]

Mathematics: "The derivative of x^2 is 2x..."

```

Requirements

Python 3.8+

PyTorch 2.0+

transformers 4.30+

CUDA-compatible GPU recommended (CPU fallback available)

~6-8GB GPU VRAM (for 1.5B models in float16)

Important Notes

The system uses Qwen2-based models fine-tuned with Unsloth for efficiency

Memory management is critical - only one expert is loaded at a time

Keyword routing provides zero-latency classification for common queries

The director model handles ambiguous or multi-domain questions

Models are loaded in float16 precision to balance quality and memory

Original code and additional resources available at: https://github.com/Agnuxo1/NEBULA

License

Apache 2.0 - Free for commercial and non-commercial use

Multi-Expert Question Answering System

Multi-Expert Question Answering System

Overview

How the System Works

Instructions

Example Usage

Initialize the MOE system

Start interactive chat

Example interactions:

User: "What is photosynthesis?"

System: [Routes to biology expert]

Biology: "Photosynthesis is the process by which plants..."

User: "How do I sort a list in Python?"

System: [Routes to programming expert]

Programming: "You can sort a list in Python using..."

User: "Calculate the derivative of x^2"

System: [Routes to mathematics expert]

Mathematics: "The derivative of x^2 is 2x..."

Requirements

Important Notes

License

Reviews (0)