Evaluate function calls in Model Context Protocol (MCP) tools. Assess correctness, tool selection, parameter names, and parameter values using a fine-tuned Qwen2.5-0.5B model.
Evaluate function calls in Model Context Protocol (MCP) tools using a specialized fine-tuned model. This skill helps you assess whether function calls are correct, use the right tools, have proper parameter names, and contain appropriate parameter values.
This skill uses the `quotientai/limbic-tool-use-0.5B-32K` model to evaluate MCP function calls against four key criteria:
1. **Tool Selection**: Verifies the function name exists and matches user intent
2. **Parameter Structure**: Ensures all required parameters are present with correct names
3. **Parameter Values**: Validates data types and value appropriateness
4. **Classification**: Returns one of four scores: `correct`, `incorrect_tool`, `incorrect_parameter_names`, or `incorrect_parameter_values`
First, ensure you have access to the HuggingFace model. Install required dependencies:
```python
pip install transformers torch
```
Load the model and tokenizer:
```python
from transformers import AutoTokenizer, AutoModelForCausalLM
tokenizer = AutoTokenizer.from_pretrained("quotientai/limbic-tool-use-0.5B-32K")
model = AutoModelForCausalLM.from_pretrained("quotientai/limbic-tool-use-0.5B-32K")
```
Define the system prompt and evaluation prompt template:
```python
SYSTEM_PROMPT = "You are an expert evaluator of function calls. You will be given a function call and a list of available tools. You will need to evaluate the function call and return a score and a reason for the score."
EVALUATOR_PROMPT = """\
---
{available_tools}
---
{message_history}
---
{{
"score": < correct | incorrect_tool | incorrect_parameter_names | incorrect_parameter_values >,
"reason": < [if incorrect, provide a brief list of reasons] >
}}
"""
```
Structure your available tools and message history:
```python
import json
available_tools = [
{
"name": "google-play-developer",
"description": "Get apps by a developer on Google Play",
"input_schema": {
"type": "object",
"properties": {
"devId": {"type": "string", "description": "Developer ID"},
"num": {"type": "number", "default": 60, "description": "Number of results"},
"lang": {"type": "string", "default": "en", "description": "Language code"},
"country": {"type": "string", "default": "us", "description": "Country code"}
},
"required": ["devId"]
}
}
]
message_history = [
{
"role": "user",
"content": "Get the top 50 apps from 'Example Developer' in English for the US market"
},
{
"role": "assistant",
"content": {
"function": "google-play-developer",
"arguments": {
"devId": "com.example.developer",
"num": 50,
"lang": "en",
"country": "us"
}
}
}
]
formatted_prompt = EVALUATOR_PROMPT.format(
available_tools=json.dumps(available_tools, indent=2),
message_history=json.dumps(message_history, indent=2)
)
```
Create the chat template and generate the evaluation:
```python
chat_template = [
{"role": "system", "content": SYSTEM_PROMPT},
{"role": "user", "content": formatted_prompt}
]
text = tokenizer.apply_chat_template(
chat_template,
tokenize=False,
add_generation_prompt=True
)
inputs = tokenizer(text, return_tensors="pt", truncation=True).to("cuda")
result = model.generate(**inputs, max_new_tokens=128, use_cache=True)
evaluation = tokenizer.decode(result[0], skip_special_tokens=True)
print(evaluation)
```
The model returns JSON output:
```python
import json
evaluation_json = json.loads(evaluation)
score = evaluation_json["score"] # One of: correct, incorrect_tool, incorrect_parameter_names, incorrect_parameter_values
reason = evaluation_json.get("reason", [])
if score == "correct":
print("✓ Function call is correct")
else:
print(f"✗ Function call has issues: {score}")
for r in reason:
print(f" - {r}")
```
When building MCP servers, use this skill to automatically validate that your function calls match the tool schemas:
```python
result = evaluate_function_call(
tools=mcp_server_tools,
user_request="Fetch weather for San Francisco",
function_call={"name": "get_weather", "args": {"city": "San Francisco"}}
)
```
Evaluate function calls made by AI agents to ensure they're using tools correctly:
```python
for agent_call in agent_history:
evaluation = evaluate_function_call(
tools=available_tools,
message_history=agent_call.history
)
if evaluation["score"] != "correct":
log_error(f"Agent error: {evaluation['reason']}")
```
Validate function call examples in training datasets:
```python
for example in training_dataset:
eval_result = evaluate_function_call(
tools=example["tools"],
message_history=example["conversation"]
)
if eval_result["score"] != "correct":
flag_for_review(example, eval_result["reason"])
```
Leave a review
No reviews yet. Be the first to review this skill!
# Download SKILL.md from killerskills.ai/api/skills/limbic-mcp-function-call-evaluator/raw