Limbic MCP Function Call Evaluator

Evaluate function calls in Model Context Protocol (MCP) tools using a specialized fine-tuned model. This skill helps you assess whether function calls are correct, use the right tools, have proper parameter names, and contain appropriate parameter values.

What This Skill Does

This skill uses the `quotientai/limbic-tool-use-0.5B-32K` model to evaluate MCP function calls against four key criteria:

1. **Tool Selection**: Verifies the function name exists and matches user intent

2. **Parameter Structure**: Ensures all required parameters are present with correct names

3. **Parameter Values**: Validates data types and value appropriateness

4. **Classification**: Returns one of four scores: `correct`, `incorrect_tool`, `incorrect_parameter_names`, or `incorrect_parameter_values`

Instructions

Step 1: Setup Model Access

First, ensure you have access to the HuggingFace model. Install required dependencies:

```python

pip install transformers torch

```

Load the model and tokenizer:

```python

from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("quotientai/limbic-tool-use-0.5B-32K")

model = AutoModelForCausalLM.from_pretrained("quotientai/limbic-tool-use-0.5B-32K")

```

Step 2: Prepare Evaluation Inputs

Define the system prompt and evaluation prompt template:

```python

SYSTEM_PROMPT = "You are an expert evaluator of function calls. You will be given a function call and a list of available tools. You will need to evaluate the function call and return a score and a reason for the score."

EVALUATOR_PROMPT = """\

TOOL CALL EVALUATION RUBRIC

EVALUATION CRITERIA

1. TOOL SELECTION

[ ] Function name exists in available tools

[ ] Function purpose matches user intent

2. PARAMETER STRUCTURE

[ ] All required and relevant parameters are present

[ ] No hallucinated parameter names

[ ] Parameter names match tool schema exactly

3. PARAMETER VALUES

[ ] Data types match expected types

[ ] Values align with user request

[ ] No fabricated or incorrect values

CLASSIFICATION RULES

All criteria passed → `correct`

Failed criteria 1 → `incorrect_tool`

Failed criteria 2 → `incorrect_parameter_names`

Failed criteria 3 → `incorrect_parameter_values`

---

AVAILABLE TOOLS

{available_tools}

---

MESSAGE HISTORY

{message_history}

---

OUTPUT REQUIREMENT

{{

"score": < correct | incorrect_tool | incorrect_parameter_names | incorrect_parameter_values >,

"reason": < [if incorrect, provide a brief list of reasons] >

}}

EVALUATION:

"""

```

Step 3: Format Your Evaluation Data

Structure your available tools and message history:

```python

import json

available_tools = [

{

"name": "google-play-developer",

"description": "Get apps by a developer on Google Play",

"input_schema": {

"type": "object",

"properties": {

"devId": {"type": "string", "description": "Developer ID"},

"num": {"type": "number", "default": 60, "description": "Number of results"},

"lang": {"type": "string", "default": "en", "description": "Language code"},

"country": {"type": "string", "default": "us", "description": "Country code"}

"required": ["devId"]

}

]

message_history = [

{

"role": "user",

"content": "Get the top 50 apps from 'Example Developer' in English for the US market"

{

"role": "assistant",

"content": {

"function": "google-play-developer",

"arguments": {

"devId": "com.example.developer",

"num": 50,

"lang": "en",

"country": "us"

}

]

Format the user prompt

formatted_prompt = EVALUATOR_PROMPT.format(

available_tools=json.dumps(available_tools, indent=2),

message_history=json.dumps(message_history, indent=2)

)

```

Step 4: Generate Evaluation

Create the chat template and generate the evaluation:

```python

chat_template = [

{"role": "system", "content": SYSTEM_PROMPT},

{"role": "user", "content": formatted_prompt}

]

Apply the chat template

text = tokenizer.apply_chat_template(

chat_template,

tokenize=False,

add_generation_prompt=True

)

Tokenize with truncation (32K context window)

inputs = tokenizer(text, return_tensors="pt", truncation=True).to("cuda")

Generate evaluation

result = model.generate(**inputs, max_new_tokens=128, use_cache=True)

Decode and parse result

evaluation = tokenizer.decode(result[0], skip_special_tokens=True)

print(evaluation)

```

Step 5: Parse and Use Results

The model returns JSON output:

```python

import json

Extract JSON from model output (assuming it's in the response)

evaluation_json = json.loads(evaluation)

score = evaluation_json["score"] # One of: correct, incorrect_tool, incorrect_parameter_names, incorrect_parameter_values

reason = evaluation_json.get("reason", [])

if score == "correct":

print("✓ Function call is correct")

else:

print(f"✗ Function call has issues: {score}")

for r in reason:

print(f" - {r}")

```

Example Use Cases

1. Validate MCP Tool Integration

When building MCP servers, use this skill to automatically validate that your function calls match the tool schemas:

```python

Test a function call against your MCP tool schema

result = evaluate_function_call(

tools=mcp_server_tools,

user_request="Fetch weather for San Francisco",

function_call={"name": "get_weather", "args": {"city": "San Francisco"}}

)

```

2. Quality Assurance for AI Agents

Evaluate function calls made by AI agents to ensure they're using tools correctly:

```python

Monitor agent function calls in production

for agent_call in agent_history:

evaluation = evaluate_function_call(

tools=available_tools,

message_history=agent_call.history

)

if evaluation["score"] != "correct":

log_error(f"Agent error: {evaluation['reason']}")

```

3. Training Data Validation

Validate function call examples in training datasets:

```python

Check dataset quality

for example in training_dataset:

eval_result = evaluate_function_call(

tools=example["tools"],

message_history=example["conversation"]

)

if eval_result["score"] != "correct":

flag_for_review(example, eval_result["reason"])

```

Important Notes

**Context Window**: The model supports up to 32,768 tokens, allowing evaluation of complex tool schemas and long conversations

**Model Size**: Only ~40MB (LoRA adapters), making it efficient for deployment

**GPU Recommended**: For faster inference, use CUDA-enabled GPU

**JSON Output**: Always parse model output as JSON for programmatic use

**Error Categories**: The model distinguishes between four error types for precise debugging

Constraints

Requires HuggingFace Transformers library and PyTorch

Best performance with GPU (CUDA support)

Input must follow the exact prompt format specified

Tools must be provided in valid JSON schema format

Message history must include both user request and assistant function call

Limbic MCP Function Call Evaluator

Limbic MCP Function Call Evaluator

What This Skill Does

Instructions

Step 1: Setup Model Access

Step 2: Prepare Evaluation Inputs

TOOL CALL EVALUATION RUBRIC

EVALUATION CRITERIA

1. TOOL SELECTION

2. PARAMETER STRUCTURE

3. PARAMETER VALUES

CLASSIFICATION RULES

AVAILABLE TOOLS

MESSAGE HISTORY

OUTPUT REQUIREMENT

EVALUATION:

Step 3: Format Your Evaluation Data

Format the user prompt

Step 4: Generate Evaluation

Apply the chat template

Tokenize with truncation (32K context window)

Generate evaluation

Decode and parse result

Step 5: Parse and Use Results

Extract JSON from model output (assuming it's in the response)

Example Use Cases

1. Validate MCP Tool Integration

Test a function call against your MCP tool schema

2. Quality Assurance for AI Agents

Monitor agent function calls in production

3. Training Data Validation

Check dataset quality

Important Notes

Constraints

Reviews (0)