Limbic Tool Use Evaluator

Evaluates function calls in the Model Context Protocol (MCP) context using a specialized fine-tuned model. Assesses whether function calls are correct, use the wrong tool, have incorrect parameter names, or have incorrect parameter values.

Description

This skill uses the `quotientai/limbic-tool-use-0.5B-32K` model (or its GGUF quantized version) to evaluate function calls against available MCP tools. The model is a LoRA fine-tuned version of Qwen2.5-0.5B-Instruct specifically trained on MCP server tools data with synthetic augmentation.

Instructions

When a user requests evaluation of MCP function calls:

1. **Load the Model**

- If not already loaded, import the model using transformers:

```python

from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("quotientai/limbic-tool-use-0.5B-32K")

model = AutoModelForCausalLM.from_pretrained("quotientai/limbic-tool-use-0.5B-32K")

```

- For GGUF format, use llama.cpp with the quantized model from `Mungert/limbic-tool-use-0.5B-32K-GGUF`

2. **Prepare the Evaluation Prompt**

- Use the following system prompt:

```

You are an expert evaluator of function calls. You will be given a function call and a list of available tools. You will need to evaluate the function call and return a score and a reason for the score.

```

- Format the user prompt with this template:

```

# TOOL CALL EVALUATION RUBRIC

## EVALUATION CRITERIA

### 1. TOOL SELECTION

- [ ] Function name exists in available tools

- [ ] Function purpose matches user intent

### 2. PARAMETER STRUCTURE

- [ ] All required and relevant parameters are present

- [ ] No hallucinated parameter names

- [ ] Parameter names match tool schema exactly

### 3. PARAMETER VALUES

- [ ] Data types match expected types

- [ ] Values align with user request

- [ ] No fabricated or incorrect values

## CLASSIFICATION RULES

- All criteria passed → `correct`

- Failed criteria 1 → `incorrect_tool`

- Failed criteria 2 → `incorrect_parameter_names`

- Failed criteria 3 → `incorrect_parameter_values`

---

### AVAILABLE TOOLS

{available_tools}

---

### MESSAGE HISTORY

{message_history}

---

## OUTPUT REQUIREMENT

{

"score": < correct | incorrect_tool | incorrect_parameter_names | incorrect_parameter_values >,

"reason": < [if incorrect, provide a brief list of reasons] >

}

### EVALUATION:

```

3. **Format Input Data**

- `available_tools`: List of tool schemas (JSON format) containing name, description, and input_schema

- `message_history`: List of messages showing the user request and assistant's function call response

4. **Generate Evaluation**

- Apply the chat template:

```python

chat_template = [

{"role": "system", "content": SYSTEM_PROMPT},

{"role": "user", "content": formatted_user_prompt}

]

text = tokenizer.apply_chat_template(chat_template, tokenize=False, add_generation_prompt=True)

inputs = tokenizer(text, return_tensors="pt", truncation=True).to("cuda")

result = model.generate(**inputs, max_new_tokens=128, use_cache=True)

```

5. **Parse and Return Results**

- The model outputs JSON with:

- `score`: One of `correct`, `incorrect_tool`, `incorrect_parameter_names`, `incorrect_parameter_values`

- `reason`: Array of reasons if incorrect (empty if correct)

- Present the evaluation results to the user in a clear format

Output Format

Return evaluation results as:

```json

{

"score": "correct|incorrect_tool|incorrect_parameter_names|incorrect_parameter_values",

"reason": ["list of reasons if incorrect"]

}

```

Example Usage

**Input:**

Available tools: `[{name: "google-play-developer", description: "Get apps by a developer on Google Play", input_schema: {...}}]`

Message history: User requests "list of apps by Example Developer, 50 apps, English, US market"

Function call: `google-play-developer(devId="com.example.developer", num=50, lang="en", country="us")`

**Output:**

```json

{

"score": "correct",

"reason": []

}

```

Constraints

Model requires GPU for reasonable inference speed (or CPU with llama.cpp GGUF format)

Maximum context length: 32,768 tokens

Model specializes in MCP function call evaluation; not suitable for general-purpose tasks

Expects tools in specific schema format with name, description, and input_schema fields

Only evaluates function calls; does not execute or modify them

Model Details

Base: Qwen/Qwen2.5-0.5B-Instruct

Size: ~40MB (LoRA adapters)

Training: MCP Server Tools data with synthetic augmentation

License: Apache 2.0

limbic-tool-use-evaluator