Llama-3-Groq-70B-Tool-Use GGUF Model

This skill provides access to the GGUF quantized version of Meta's Llama 3 70B model, specifically fine-tuned by Groq for tool use and function calling capabilities. The model is available in multiple quantization levels to accommodate different hardware configurations and quality requirements.

Overview

The Llama-3-Groq-70B-Tool-Use model is a 70-billion parameter language model optimized for:

Function calling and tool use

Structured output generation

API integration scenarios

Agent-based workflows

This GGUF quantized version by mradermacher includes weighted/imatrix quantizations for improved quality at lower bit depths.

Available Quantizations

The model is available in multiple quantization levels (size vs quality trade-offs):

**IQ1_S** (15.4GB) - Smallest, lowest quality, "for the desperate"

**IQ2_XXS** (19.2GB) - Very compressed

**IQ3_S** (31.0GB) - Good balance, beats Q3_K variants

**Q4_K_S** (40.4GB) - Optimal size/speed/quality

**Q4_K_M** (42.6GB) - Fast, recommended for most use cases

**Q5_K_M** (50.0GB) - Higher quality

**Q6_K** (58.0GB) - Near-original quality (multi-part download)

Instructions

When a user requests to use this model for tool use or function calling:

1. **Assess Hardware Requirements**

- Ask about available RAM/VRAM

- Recommend quantization level based on resources:

- 16GB RAM: IQ3_S or lower

- 32GB RAM: Q4_K_S or Q4_K_M

- 48GB+ RAM: Q5_K_M or Q6_K

2. **Download the Model**

- Direct user to HuggingFace repository: `mradermacher/Llama-3-Groq-70B-Tool-Use-i1-GGUF`

- For Q6_K, explain multi-part download and concatenation required

- Provide download command examples for their runtime

3. **Configure Runtime**

- For llama.cpp: Provide context size, GPU layers, and thread configuration

- For Ollama: Create Modelfile with appropriate parameters

- For LM Studio: Import model and set context/temperature

- For GPT4All: Load model with recommended settings

4. **Set Up for Tool Use**

- Configure system prompt for function calling

- Define function schemas in JSON format

- Set appropriate temperature (0.1-0.3 for structured output)

- Enable JSON mode if supported by runtime

5. **Test Function Calling**

- Provide example function definitions

- Test with sample queries requiring tool use

- Validate structured output parsing

6. **Optimize Performance**

- Adjust context window based on use case

- Configure GPU offloading if available

- Set appropriate batch size for inference

Usage Examples

Example 1: Ollama Setup

```bash

Create Modelfile

FROM ./Llama-3-Groq-70B-Tool-Use.i1-Q4_K_M.gguf

PARAMETER temperature 0.2

PARAMETER num_ctx 8192

SYSTEM You are a helpful assistant with access to functions.

```

Example 2: Function Definition

```json

{

"name": "get_weather",

"description": "Get current weather for a location",

"parameters": {

"type": "object",

"properties": {

"location": {"type": "string"},

"unit": {"type": "string", "enum": ["celsius", "fahrenheit"]}

"required": ["location"]

}

```

Example 3: Tool Use Query

```

User: What's the weather in San Francisco?

Model: <function_call>{"name": "get_weather", "arguments": {"location": "San Francisco", "unit": "fahrenheit"}}</function_call>

```

Important Notes

This is a quantized model; quality decreases with lower bit depths

IQ (importance matrix) quantizations generally offer better quality than standard Q quantizations at similar sizes

Multi-part files (Q6_K) must be concatenated before use

Model requires llama.cpp-compatible runtime or derivative (Ollama, LM Studio, etc.)

Original model license: Llama 3 Community License

For static (non-imatrix) quants, see: `mradermacher/Llama-3-Groq-70B-Tool-Use-GGUF`

Constraints

Minimum 16GB RAM required (for smallest quants)

CPU inference is possible but very slow; GPU acceleration strongly recommended

Context window and performance depend on quantization level and available VRAM

Function calling capability depends on proper prompt formatting and system instructions

Llama-3-Groq-70B-Tool-Use Model