Evaluates Model Context Protocol (MCP) function calls for correctness, tool selection, parameter names, and parameter values using a fine-tuned Qwen2.5-0.5B model
Evaluates function calls in the Model Context Protocol (MCP) context using a specialized fine-tuned model. Assesses whether function calls are correct, use the wrong tool, have incorrect parameter names, or have incorrect parameter values.
This skill uses the `quotientai/limbic-tool-use-0.5B-32K` model (or its GGUF quantized version) to evaluate function calls against available MCP tools. The model is a LoRA fine-tuned version of Qwen2.5-0.5B-Instruct specifically trained on MCP server tools data with synthetic augmentation.
When a user requests evaluation of MCP function calls:
1. **Load the Model**
- If not already loaded, import the model using transformers:
```python
from transformers import AutoTokenizer, AutoModelForCausalLM
tokenizer = AutoTokenizer.from_pretrained("quotientai/limbic-tool-use-0.5B-32K")
model = AutoModelForCausalLM.from_pretrained("quotientai/limbic-tool-use-0.5B-32K")
```
- For GGUF format, use llama.cpp with the quantized model from `Mungert/limbic-tool-use-0.5B-32K-GGUF`
2. **Prepare the Evaluation Prompt**
- Use the following system prompt:
```
You are an expert evaluator of function calls. You will be given a function call and a list of available tools. You will need to evaluate the function call and return a score and a reason for the score.
```
- Format the user prompt with this template:
```
# TOOL CALL EVALUATION RUBRIC
## EVALUATION CRITERIA
### 1. TOOL SELECTION
- [ ] Function name exists in available tools
- [ ] Function purpose matches user intent
### 2. PARAMETER STRUCTURE
- [ ] All required and relevant parameters are present
- [ ] No hallucinated parameter names
- [ ] Parameter names match tool schema exactly
### 3. PARAMETER VALUES
- [ ] Data types match expected types
- [ ] Values align with user request
- [ ] No fabricated or incorrect values
## CLASSIFICATION RULES
- All criteria passed → `correct`
- Failed criteria 1 → `incorrect_tool`
- Failed criteria 2 → `incorrect_parameter_names`
- Failed criteria 3 → `incorrect_parameter_values`
---
### AVAILABLE TOOLS
{available_tools}
---
### MESSAGE HISTORY
{message_history}
---
## OUTPUT REQUIREMENT
{
"score": < correct | incorrect_tool | incorrect_parameter_names | incorrect_parameter_values >,
"reason": < [if incorrect, provide a brief list of reasons] >
}
### EVALUATION:
```
3. **Format Input Data**
- `available_tools`: List of tool schemas (JSON format) containing name, description, and input_schema
- `message_history`: List of messages showing the user request and assistant's function call response
4. **Generate Evaluation**
- Apply the chat template:
```python
chat_template = [
{"role": "system", "content": SYSTEM_PROMPT},
{"role": "user", "content": formatted_user_prompt}
]
text = tokenizer.apply_chat_template(chat_template, tokenize=False, add_generation_prompt=True)
inputs = tokenizer(text, return_tensors="pt", truncation=True).to("cuda")
result = model.generate(**inputs, max_new_tokens=128, use_cache=True)
```
5. **Parse and Return Results**
- The model outputs JSON with:
- `score`: One of `correct`, `incorrect_tool`, `incorrect_parameter_names`, `incorrect_parameter_values`
- `reason`: Array of reasons if incorrect (empty if correct)
- Present the evaluation results to the user in a clear format
Return evaluation results as:
```json
{
"score": "correct|incorrect_tool|incorrect_parameter_names|incorrect_parameter_values",
"reason": ["list of reasons if incorrect"]
}
```
**Input:**
**Output:**
```json
{
"score": "correct",
"reason": []
}
```
Leave a review
No reviews yet. Be the first to review this skill!
# Download SKILL.md from killerskills.ai/api/skills/limbic-tool-use-evaluator/raw