Mixture of Experts Question Answering System
This skill implements a Multi-Expert (MOE) Question Answering System that intelligently routes user questions to specialized AI models based on the domain (programming, biology, mathematics). It uses the Qwen2-1.5B model series with dynamic model loading for memory efficiency.
What This Skill Does
Creates an intelligent question-answering system that:
Maintains a "director" model for question classificationDynamically loads expert models only when neededUses keyword matching for fast domain detectionFalls back to AI classification for ambiguous questionsManages GPU/CPU memory efficiently by loading only one expert at a timeProvides a conversational chat interfaceStep-by-Step Instructions
1. Initialize the MOE System Architecture
Set up the core components:
Define a `MODEL_CONFIG` dictionary mapping expert domains to their HuggingFace model identifiersInclude these domains: director, programming, biology, mathematicsSpecify the task type (text-generation) for each modelUse Qwen2-1.5B variants optimized for each domain2. Create Keyword Dictionaries
Build domain-specific keyword lists:
For **biology**: cell, DNA, protein, evolution, genetics, ecosystem, organism, metabolism, photosynthesis, microbiology (include Spanish translations)For **mathematics**: equation, integral, derivative, function, geometry, algebra, statistics, probability (include Spanish translations)For **programming**: python, java, C++, HTML, code, API, framework, debugging, algorithm, database, Git, machine learning, etc.Store in a `KEYWORDS` dictionary keyed by domain name3. Implement the MOELLM Class
Create the main orchestrator class with these methods:
**Initialization (`__init__`)**:
Detect CUDA availability and set device (cuda/cpu)Initialize current_expert, current_model, current_tokenizer to NoneImmediately load the director model**Director Model Loading (`load_director_model`)**:
Load the Agnuxo/Qwen2-1.5B-Instruct_MOE_Director_16bit modelUse torch.float16 for memory efficiencyCreate a text-generation pipeline for the directorPrint confirmation when loaded**Dynamic Expert Loading (`load_expert_model`)**:
Check if requested expert differs from current_expertIf different, free memory: delete current model/tokenizer, call `torch.cuda.empty_cache()`Load the new expert's tokenizer and model (torch.float16)Update current_expert trackerReturn a pipeline for the loaded expert4. Implement Question Routing Logic
**Keyword-Based Routing (`determine_expert_by_keywords`)**:
Convert question to lowercaseIterate through KEYWORDS dictionaryReturn first domain where any keyword matchesReturn None if no keywords match**Hybrid Routing (`determine_expert`)**:
First attempt keyword matchingIf successful, return the expert immediatelyIf no keyword match, construct a classification prompt for the directorPrompt format: "Classify the following question into one of these categories: programming, biology, mathematics. Question: {question}\nCategory:"Parse the director's response to extract the categoryValidate the category exists in MODEL_CONFIGDefault to "director" if category invalid5. Implement Response Generation
**Response Generation (`generate_response`)**:
Load the appropriate expert model using `load_expert_model`Construct a prompt: "Answer the following question as an expert in {expert}: {question}\nAnswer:"Generate response with max_length=200Extract the answer portion after "Answer:" delimiterImplement error handling with user-friendly error messages6. Create the Chat Interface
**Chat Loop (`chat_interface`)**:
Display welcome message and exit instructionsEnter infinite loop reading user inputBreak on 'exit' or 'quit' commandsFor each question: - Determine the expert using `determine_expert`
- Generate response using `generate_response`
- Display response prefixed with expert name
- Handle exceptions gracefully with retry prompts
7. Set Up the Entry Point
In the main block:
Instantiate MOELLM classCall `chat_interface()` to start the interactive sessionImplementation Notes
**Memory Management**: The system only keeps one expert model in memory at a time. When switching experts, explicitly delete the previous model and clear CUDA cache to prevent OOM errors.
**Model Precision**: Use `torch.float16` for all models to reduce memory footprint while maintaining acceptable accuracy.
**Fallback Strategy**: Keyword matching provides zero-latency routing for clear questions. The director model handles edge cases and ambiguous questions.
**Bilingual Support**: Keyword lists include Spanish translations to support multilingual question matching.
**Error Resilience**: Wrap model loading and generation in try-except blocks. Provide actionable error messages to users.
Example Usage
```python
User asks: "What is photosynthesis?"
System matches keyword "photosynthesis" → routes to biology expert
Biology model generates domain-specific answer
User asks: "How do I sort a list in Python?"
System matches keyword "python" → routes to programming expert
Programming model generates code example
User asks: "Explain quantum mechanics"
No keyword match → director classifies as general science
Falls back to director model for general response
```
Requirements
Python 3.8+transformerstorch (with CUDA support recommended)8GB+ GPU memory (for single expert + director) or CPU with 16GB+ RAMModel Information
**Base Model**: Agnuxo/Qwen2-1.5B-Instruct_MOE_assistant_16bit**Quantization**: GGUF 8-bit**Training**: Fine-tuned with Unsloth for 2x training speed**License**: Apache 2.0References
Full implementation: https://huggingface.co/Agnuxo/Qwen2-1.5B-Instruct_MOE_Director_16bit/resolve/main/MOE-LLMs3.pyGitHub repository: https://github.com/Agnuxo1/NEBULA