Qwen2 MOE Code Assistant

A GGUF-quantized 1.5B parameter code assistant model based on Qwen2-1.5B-Instruct, optimized for programming questions within a Multi-Expert (MOE) system architecture.

Overview

This model is the **code/programming expert** component of a larger MOE Question Answering System. It has been fine-tuned using Unsloth and Hugging Face's TRL library to provide specialized programming assistance. The model is available in 21 different quantization levels (Q2_K through Q8_0) to balance performance and resource usage.

Multi-Expert System Architecture

This model works as part of a dynamic expert routing system:

1. **Director Model** classifies incoming questions by domain (programming, biology, mathematics, etc.)

2. **Expert Models** are loaded on-demand based on question classification

3. **Dynamic Memory Management** releases previous expert models to optimize resource usage

4. **Specialized Responses** are generated by the domain-specific expert

Usage Instructions

Step 1: Choose Your Quantization

Select a quantization level based on your hardware and accuracy requirements:

**Q2_K - Q3_K**: Ultra-low memory (0.63-0.82GB) - fastest, reduced accuracy

**Q4_K - Q5_K**: Balanced (0.84-1.1GB) - recommended for most use cases

**Q6_K - Q8_0**: High accuracy (1.19-1.53GB) - best quality, more resources required

Step 2: Download the Model

Download your chosen quantization from the Hugging Face repository:

```bash

Example: Download Q4_K_M (recommended balanced option)

huggingface-cli download RichardErkhov/Agnuxo_-_Qwen2-1.5B-Instruct_MOE_CODE_assistant_16bit-gguf \

Qwen2-1.5B-Instruct_MOE_CODE_assistant_16bit.Q4_K_M.gguf \

--local-dir ./models

```

Step 3: Load in Your Runtime

**llama.cpp:**

```bash

./main -m models/Qwen2-1.5B-Instruct_MOE_CODE_assistant_16bit.Q4_K_M.gguf \

-p "Write a Python function to calculate Fibonacci numbers:" \

-n 256 --temp 0.7

```

**Ollama:**

```bash

Create Modelfile

echo 'FROM ./models/Qwen2-1.5B-Instruct_MOE_CODE_assistant_16bit.Q4_K_M.gguf' > Modelfile

ollama create qwen2-code -f Modelfile

ollama run qwen2-code "Explain how async/await works in JavaScript"

```

**Python (llama-cpp-python):**

```python

from llama_cpp import Llama

llm = Llama(

model_path="models/Qwen2-1.5B-Instruct_MOE_CODE_assistant_16bit.Q4_K_M.gguf",

n_ctx=2048,

n_threads=8

)

response = llm("Answer the following question as a programming expert: How do I reverse a string in Python?\nAnswer:",

max_tokens=200,

temperature=0.7)

print(response['choices'][0]['text'])

```

Step 4: Integrate into MOE System (Optional)

To use this model as part of the full MOE architecture, integrate it into the expert routing system. The director model will automatically route programming-related questions to this model based on keyword detection or classification.

**Keywords that trigger this expert:**

`python`, `java`, `C++`, `HTML`, `script`, `code`, `API`, `framework`, `debugging`, `algorithm`, `database`, `CSS`, `JSON`, `encryption`, `Git`, `machine learning`

Prompt Format

For optimal results, use this prompt template:

```

Answer the following question as an expert in programming: {your_question}

Answer:

```

Examples

**Code Generation:**

```

Q: Write a Python function to check if a number is prime.

A: Here's an efficient implementation:

def is_prime(n):

if n < 2:

return False

if n == 2:

return True

if n % 2 == 0:

return False

for i in range(3, int(n**0.5) + 1, 2):

if n % i == 0:

return False

return True

```

**Debugging Assistance:**

```

Q: Why am I getting a "list index out of range" error in Python?

A: This error occurs when you try to access an index that doesn't exist in a list. Check that your index is within the valid range (0 to len(list)-1) and ensure your list isn't empty before accessing elements.

```

Technical Specifications

**Base Model**: Qwen2-1.5B-Instruct (Agnuxo/Qwen2-1.5B-Instruct_MOE_assistant_16bit)

**Format**: GGUF (compatible with llama.cpp ecosystem)

**License**: Apache 2.0

**Context Window**: 2048+ tokens (runtime-dependent)

**Training**: Fine-tuned with Unsloth + TRL for code-specific tasks

Performance Recommendations

| Hardware | Recommended Quantization | Expected Speed |

|----------|-------------------------|----------------|

| 4GB RAM | Q2_K - Q3_K | Very fast |

| 8GB RAM | Q4_K - Q5_K | Fast |

| 16GB+ RAM | Q6_K - Q8_0 | Best quality |

Notes

This model is optimized for **code-related questions** and works best within the MOE system architecture

For production MOE deployment, refer to the full system code at the [NEBULA repository](https://github.com/Agnuxo1/NEBULA)

The model supports English language programming queries

Quantized by Richard Erkhov - [Request more models](https://github.com/RichardErkhov/quant_request)

Links

**Original Model**: [Agnuxo/Qwen2-1.5B-Instruct_MOE_CODE_assistant_16bit](https://huggingface.co/Agnuxo/Qwen2-1.5B-Instruct_MOE_CODE_assistant_16bit)

**GGUF Repository**: [RichardErkhov GGUF Models](https://huggingface.co/RichardErkhov/Agnuxo_-_Qwen2-1.5B-Instruct_MOE_CODE_assistant_16bit-gguf)

**Full MOE System**: [NEBULA GitHub](https://github.com/Agnuxo1/NEBULA)

**Quantizer Discord**: [Richard Erkhov Discord](https://discord.gg/pvy7H8DZMG)

Qwen2 MOE Code Assistant

Qwen2 MOE Code Assistant

Overview

Multi-Expert System Architecture

Usage Instructions

Step 1: Choose Your Quantization

Step 2: Download the Model

Example: Download Q4_K_M (recommended balanced option)

Step 3: Load in Your Runtime

Create Modelfile

Step 4: Integrate into MOE System (Optional)

Prompt Format

Examples

Technical Specifications

Performance Recommendations

Notes

Links

Reviews (0)