DeepScaleR 1.5B GGUF Model

This skill helps you download, configure, and run the DeepScaleR 1.5B model in GGUF format for local inference. The model is specialized in mathematical reasoning and problem-solving, quantized using llama.cpp with imatrix optimization.

Overview

DeepScaleR 1.5B is a 1.5 billion parameter model trained on mathematical datasets (NuminaMath-CoT, Omni-MATH, STILL-3-Preview-RL-Data, competition_math). This skill guides you through selecting the appropriate quantization level, downloading the model, and running it locally with llama.cpp or compatible tools like LM Studio.

Instructions

Step 1: Assess System Requirements

Determine available RAM and VRAM to select the appropriate quantization:

**GPU-only inference (fastest)**: Choose a quant 1-2GB smaller than your GPU VRAM

**CPU + GPU inference (maximum quality)**: Add system RAM + GPU VRAM, choose a quant 1-2GB smaller than total

**CPU-only inference**: Choose based on available system RAM

Step 2: Select Quantization Level

Recommend quantization based on user's hardware and priorities:

**High quality (recommended for most users):**

`Q6_K_L` (1.58GB) - Very high quality, uses Q8_0 for embed/output weights

`Q6_K` (1.46GB) - Very high quality, near perfect

`Q5_K_M` (1.29GB) - High quality, good balance

**Balanced (good quality, smaller size):**

`Q4_K_M` (1.12GB) - Default recommendation for most use cases

`Q4_K_S` (1.07GB) - Slightly lower quality with more space savings

`IQ4_XS` (1.02GB) - Decent quality, smaller than Q4_K_S

**Low resource (for limited RAM):**

`Q3_K_M` (0.92GB) - Low quality but usable

`IQ3_M` (0.88GB) - Medium-low quality, comparable to Q3_K_M

`Q2_K` (0.75GB) - Very low quality but surprisingly usable

**Note**: For ARM or AVX systems, Q4_0 and IQ4_NL support online weight repacking for better performance.

Step 3: Install Prerequisites

Ensure the user has either:

**Option A: llama.cpp (recommended)**

```bash

Clone and build llama.cpp

git clone https://github.com/ggerganov/llama.cpp

cd llama.cpp

make

```

**Option B: LM Studio**

Download from https://lmstudio.ai/

No build required, GUI-based

**Option C: Hugging Face CLI**

```bash

pip install -U "huggingface_hub[cli]"

```

Step 4: Download the Model

Provide the appropriate download command based on the selected quantization:

**Single file download (most quants):**

```bash

huggingface-cli download bartowski/agentica-org_DeepScaleR-1.5B-Preview-GGUF \

--include "DeepScaleR-1.5B-Preview-Q4_K_M.gguf" \

--local-dir ./models

```

**Split file download (for models >50GB, if applicable):**

```bash

huggingface-cli download bartowski/agentica-org_DeepScaleR-1.5B-Preview-GGUF \

--include "DeepScaleR-1.5B-Preview-Q8_0/*" \

--local-dir ./models

```

Replace `Q4_K_M` with the user's chosen quantization.

Step 5: Configure Prompt Format

The model uses a specific prompt format:

```

<｜begin▁of▁sentence｜>{system_prompt}<｜User｜>{prompt}<｜Assistant｜><｜end▁of▁sentence｜><｜Assistant｜>

```

**Example system prompt for math problems:**

```

<｜begin▁of▁sentence｜>You are a helpful AI assistant specialized in solving mathematical problems. Show your reasoning step-by-step.<｜User｜>Solve: What is the integral of x^2 from 0 to 1?<｜Assistant｜><｜end▁of▁sentence｜><｜Assistant｜>

```

Step 6: Run the Model

**Using llama.cpp:**

```bash

./llama-cli \

-m ./models/DeepScaleR-1.5B-Preview-Q4_K_M.gguf \

-p "<｜begin▁of▁sentence｜>You are a helpful assistant.<｜User｜>Hello!<｜Assistant｜><｜end▁of▁sentence｜><｜Assistant｜>" \

-n 512 \

--temp 0.7 \

--top-p 0.9

```

**Using LM Studio:**

1. Open LM Studio

2. Search for "bartowski/agentica-org_DeepScaleR-1.5B-Preview-GGUF"

3. Download your chosen quantization

4. Load the model and configure the prompt format in settings

5. Start chatting

Step 7: Optimize Performance

**For NVIDIA GPUs (cuBLAS):**

```bash

./llama-cli -m model.gguf -ngl 99 -p "prompt"

```

(`-ngl 99` offloads all layers to GPU)

**For AMD GPUs (ROCm):**

Use the ROCm-enabled build of llama.cpp or LM Studio ROCm preview.

**For Apple Silicon:**

Metal acceleration is automatic. Consider IQ4_NL for ARM-optimized performance.

**For CPU (AVX2/AVX512):**

Q4_0 and IQ4_NL support online weight repacking for better performance.

Usage Examples

**Example 1: Math problem solving**

```bash

./llama-cli -m DeepScaleR-1.5B-Preview-Q5_K_M.gguf \

-p "<｜begin▁of▁sentence｜>Solve this math problem step by step.<｜User｜>A rectangle has a perimeter of 20 cm and an area of 24 cm². What are its dimensions?<｜Assistant｜><｜end▁of▁sentence｜><｜Assistant｜>" \

-n 1024

```

**Example 2: Interactive mode**

```bash

./llama-cli -m DeepScaleR-1.5B-Preview-Q4_K_M.gguf -i --interactive-first

```

Important Notes

This model is specialized for mathematical reasoning. For general-purpose tasks, consider other models.

The model uses MIT license, allowing commercial use.

Quantization levels below Q3 may show significant quality degradation.

For optimal math performance, use Q5_K_M or higher quantization.

Online weight repacking (Q4_0, IQ4_NL) requires llama.cpp build b4282 or later.

The model file size ranges from 0.75GB (Q2_K) to 7.11GB (f32).

Troubleshooting

**Issue: Out of memory errors**

Solution: Use a smaller quantization (Q3_K_M, Q2_K)

**Issue: Slow inference**

Solution: Enable GPU offloading with `-ngl`, or use a smaller quant

**Issue: Poor output quality**

Solution: Use a higher quantization (Q5_K_M or Q6_K), check prompt format

**Issue: Cannot load Q4_0_X_X files**

Solution: Update llama.cpp to b4282+ and use Q4_0 instead (supports online repacking)

DeepScaleR 1.5B GGUF Model

DeepScaleR 1.5B GGUF Model

Overview

Instructions

Step 1: Assess System Requirements

Step 2: Select Quantization Level

Step 3: Install Prerequisites

Clone and build llama.cpp

Step 4: Download the Model

Step 5: Configure Prompt Format

Step 6: Run the Model

Step 7: Optimize Performance

Usage Examples

Important Notes

Troubleshooting

Reviews (0)