WebLLM In-Browser Inference

Integrate hardware-accelerated language model inference directly into web applications using WebLLM. This skill helps you implement in-browser LLM capabilities with WebGPU acceleration, supporting streaming, JSON mode, function calling, and full OpenAI API compatibility.

What This Skill Does

This skill guides you through implementing WebLLM (`@mlc-ai/web-llm`) in web applications to run large language models entirely client-side with hardware acceleration. It covers installation, engine creation, chat completions, streaming, worker threads, service workers, Chrome extensions, and custom model integration.

Implementation Steps

1. Install WebLLM Package

Add the WebLLM package to your project:

```bash

npm install @mlc-ai/web-llm

```

For CDN usage in online editors (JSFiddle, Codepen):

```javascript

import * as webllm from "https://esm.run/@mlc-ai/web-llm";

```

2. Create MLCEngine Instance

Initialize the engine with model loading and progress tracking:

```typescript

import { CreateMLCEngine } from "@mlc-ai/web-llm";

const initProgressCallback = (initProgress) => {

console.log(initProgress);

};

const selectedModel = "Llama-3.1-8B-Instruct-q4f32_1-MLC";

const engine = await CreateMLCEngine(

selectedModel,

{ initProgressCallback: initProgressCallback }

);

```

**Note:** First-time model loading requires downloading and may take significant time.

3. Implement Chat Completion

Generate responses using OpenAI-style API:

```typescript

const messages = [

{ role: "system", content: "You are a helpful AI assistant." },

{ role: "user", content: "Hello!" }

];

const reply = await engine.chat.completions.create({

messages,

});

console.log(reply.choices[0].message);

console.log(reply.usage);

```

4. Enable Streaming Responses

Implement real-time output generation:

```typescript

const chunks = await engine.chat.completions.create({

messages,

temperature: 1,

stream: true,

stream_options: { include_usage: true }

});

let reply = "";

for await (const chunk of chunks) {

reply += chunk.choices[0]?.delta.content || "";

console.log(reply);

if (chunk.usage) {

console.log(chunk.usage); // only last chunk has usage

}

```

5. Optimize with Web Workers

Offload computation to prevent UI blocking:

**worker.ts:**

```typescript

import { WebWorkerMLCEngineHandler } from "@mlc-ai/web-llm";

const handler = new WebWorkerMLCEngineHandler();

self.onmessage = (msg: MessageEvent) => {

handler.onmessage(msg);

};

```

**main.ts:**

```typescript

import { CreateWebWorkerMLCEngine } from "@mlc-ai/web-llm";

const engine = await CreateWebWorkerMLCEngine(

new Worker(

new URL("./worker.ts", import.meta.url),

{ type: "module" }

selectedModel,

{ initProgressCallback }

);

```

6. Implement Service Worker Support

Enable persistent model loading across page visits:

**sw.ts:**

```typescript

import { ServiceWorkerMLCEngineHandler } from "@mlc-ai/web-llm";

let handler: ServiceWorkerMLCEngineHandler;

self.addEventListener("activate", function (event) {

handler = new ServiceWorkerMLCEngineHandler();

console.log("Service Worker is ready");

});

```

**main.ts:**

```typescript

import { CreateServiceWorkerMLCEngine } from "@mlc-ai/web-llm";

if ("serviceWorker" in navigator) {

navigator.serviceWorker.register(

new URL("sw.ts", import.meta.url),

{ type: "module" }

);

}

const engine = await CreateServiceWorkerMLCEngine(

selectedModel,

{ initProgressCallback }

);

```

7. Add JSON Mode Support

Enforce structured JSON output:

```typescript

const reply = await engine.chat.completions.create({

messages: [

{ role: "system", content: "You are a helpful assistant that responds in JSON." },

{ role: "user", content: "Generate user profile data" }

response_format: { type: "json_object" }

});

```

8. Configure Reproducible Outputs

Use seeding for consistent results:

```typescript

const reply = await engine.chat.completions.create({

messages,

seed: 42,

temperature: 0.7

});

```

Supported Models

WebLLM supports multiple model families:

**Llama**: Llama 3, Llama 2, Hermes-2-Pro-Llama-3

**Phi**: Phi 3, Phi 2, Phi 1.5

**Gemma**: Gemma-2B

**Mistral**: Mistral-7B-v0.3, Hermes-2-Pro-Mistral-7B, NeuralHermes-2.5-Mistral-7B

**Qwen (通义千问)**: Qwen2 0.5B, 1.5B, 7B

Access the complete list at `prebuiltAppConfig.model_list` or visit [MLC Models](https://mlc.ai/models).

Advanced Features

Chrome Extension Integration

Implement WebLLM in Chrome extensions with persistent background service workers. Check examples in the repository for full implementation patterns.

Custom Model Deployment

Integrate custom models in MLC format by compiling them with MLC LLM toolchain. Refer to [MLC LLM documentation](https://llm.mlc.ai/docs/deploy/webllm.html) for compilation instructions.

Function Calling

Implement function calling with `tools` and `tool_choice` parameters (preliminary support available).

Important Constraints

**WebGPU Required**: Users must have WebGPU-compatible browsers (Chrome 113+, Edge 113+)

**Model Size**: First download can be large (several GB depending on model)

**Browser Cache**: Models are cached locally after first download

**Service Worker Lifecycle**: Browser may kill service workers; implement proper error handling with `keepAliveMs` and `missedHeatbeat` parameters

**No Model Parameter**: The `model` parameter is ignored in `engine.chat.completions.create()` calls; set model via `CreateMLCEngine(model)` or `engine.reload(model)` instead

Usage Examples

Basic chatbot implementation available at:

JSFiddle: https://jsfiddle.net/neetnestor/4nmgvsa2/

Codepen: https://codepen.io/neetnestor/pen/vYwgZaG

Live demo: https://chat.webllm.ai/

Resources

Documentation: https://webllm.mlc.ai/docs/

NPM Package: https://www.npmjs.com/package/@mlc-ai/web-llm

GitHub Repository: https://github.com/mlc-ai/web-llm

Examples Directory: Check `/examples` folder in repository

Paper: https://arxiv.org/abs/2412.15803

WebLLM In-Browser Inference

WebLLM In-Browser Inference

What This Skill Does

Implementation Steps

1. Install WebLLM Package

2. Create MLCEngine Instance

3. Implement Chat Completion

4. Enable Streaming Responses

5. Optimize with Web Workers

6. Implement Service Worker Support

7. Add JSON Mode Support

8. Configure Reproducible Outputs

Supported Models

Advanced Features

Chrome Extension Integration

Custom Model Deployment

Function Calling

Important Constraints

Usage Examples

Resources

Reviews (0)