Phi-4 Function Calling via llama-server

Run Microsoft's Phi-4-mini-instruct model locally with full function calling and tool use support through llama-server (llama-cpp-python). This provides a ChatGPT-compatible API for integrating local LLMs with external tools and APIs.

What This Skill Does

This skill sets up and runs a modified version of Phi-4-mini-instruct that supports:

Tool usage and function calling capabilities

Custom chat templates optimized for tool interactions

Processing and responding to tool outputs

ChatGPT-compatible API interface on localhost

Prerequisites

Before running this skill, ensure you have:

Python 3.8+ installed

Sufficient disk space for the GGUF model file (~4-8GB depending on quantization)

Available port (default: 8080) for the server

Step-by-Step Instructions

1. Install llama-cpp-python with Server Support

First, install the required Python package with server capabilities:

```bash

pip install llama-cpp-python[server]

```

**Important:** The `[server]` extra is required for the `llama-server` command.

2. Download the Model

Download the Phi-4-mini-instruct GGUF model file. You can obtain it from:

Hugging Face: `ariel-pillar/phi-4_function_calling`

Direct download of `Phi-4-mini-instruct-Q4_K_M-function_calling.gguf`

Place the model file in a `models/` directory:

```bash

mkdir -p models

Move your downloaded GGUF file to models/Phi-4-mini-instruct-Q4_K_M-function_calling.gguf

```

3. Start the llama-server

Launch the server with the following command:

```bash

llama-server \

--model models/Phi-4-mini-instruct-Q4_K_M-function_calling.gguf \

--port 8080 \

--jinja

```

**Flags explained:**

`--model`: Path to your GGUF model file

`--port`: Server port (change if 8080 is in use)

`--jinja`: Enables Jinja template formatting (required for tool use)

The server will start and display initialization messages. Once you see "llama-server listening on port 8080", it's ready.

4. Test the API

#### Example 1: Function Calling with Python Tool

Test tool/function calling capabilities:

```bash

curl http://localhost:8080/v1/chat/completions -d '{

"model": "phi-4-mini-instruct-with-tools",

"tools": [

{

"type": "function",

"function": {

"name": "python",

"description": "Runs code in an ipython interpreter and returns the result of the execution after 60 seconds.",

"parameters": {

"type": "object",

"properties": {

"code": {

"type": "string",

"description": "The code to run in the ipython interpreter."

}

"required": ["code"]

}

"messages": [

{

"role": "user",

"content": "Print a hello world message with python."

}

]

```

#### Example 2: Simple Chat Completion

```bash

curl http://localhost:8080/v1/chat/completions \

-H "Content-Type: application/json" \

-d '{

"model": "phi-4-mini-instruct-with-tools",

"messages": [

{"role": "system", "content": "You are a helpful assistant"},

{"role": "user", "content": "Explain what function calling means in LLMs"}

]

```

#### Example 3: Code Generation

```bash

curl http://localhost:8080/v1/chat/completions \

-H "Content-Type: application/json" \

-d '{

"model": "phi-4-mini-instruct-with-tools",

"messages": [

{"role": "system", "content": "You are a helpful coding assistant"},

{"role": "user", "content": "give me an html hello world document"}

]

```

5. Available API Endpoints

The server provides ChatGPT-compatible endpoints:

`POST /v1/chat/completions` - Chat completions with optional tool definitions

`POST /v1/completions` - Standard text completions

`GET /v1/models` - List available models

6. Integration with AI Agents

You can configure AI coding assistants to use this local endpoint instead of OpenAI's API. Most tools support custom OpenAI-compatible endpoints.

Example configuration:

**Base URL**: `http://localhost:8080/v1`

**Model**: `phi-4-mini-instruct-with-tools`

**API Key**: Not required (local server)

Important Notes

**Security Warning:** This model demonstrates a proof-of-concept for supply chain attacks leveraging poisoned chat templates. Only use models from trusted sources. See the full context at: https://www.pillar.security/blog/llm-backdoors-at-the-inference-level-the-threat-of-poisoned-templates

**Resource Requirements:**

RAM: 8-16GB recommended for Q4 quantization

CPU: Multi-core processor recommended

No GPU required (CPU inference), but GPU acceleration available with proper llama-cpp-python build

**Troubleshooting:**

If port 8080 is in use, change `--port` to another value (e.g., 8082)

Ensure the model file path is correct and accessible

For GPU acceleration, rebuild llama-cpp-python with CUDA/Metal support

Check logs for memory issues if model fails to load

**Customization:**

Add `--n-gpu-layers N` for GPU offloading (requires GPU build)

Use `--ctx-size N` to adjust context window (default: 2048)

Add `--threads N` to control CPU thread usage

License and Compliance

This setup uses Microsoft's Phi-4-mini-instruct model. Ensure compliance with:

Microsoft's Phi-4 model license

llama-cpp-python MIT license

Any applicable usage restrictions for commercial applications

Phi-4 Function Calling via llama-server

Phi-4 Function Calling via llama-server

What This Skill Does

Prerequisites

Step-by-Step Instructions

1. Install llama-cpp-python with Server Support

2. Download the Model

Move your downloaded GGUF file to models/Phi-4-mini-instruct-Q4_K_M-function_calling.gguf

3. Start the llama-server

4. Test the API

5. Available API Endpoints

6. Integration with AI Agents

Important Notes

License and Compliance

Reviews (0)