Run hardware-accelerated large language models directly in the browser using WebGPU. Supports streaming, JSON mode, function calling, and full OpenAI API compatibility with models like Llama, Phi, Gemma, Mistral, and Qwen.
Integrate hardware-accelerated language model inference directly into web applications using WebLLM. This skill helps you implement in-browser LLM capabilities with WebGPU acceleration, supporting streaming, JSON mode, function calling, and full OpenAI API compatibility.
This skill guides you through implementing WebLLM (`@mlc-ai/web-llm`) in web applications to run large language models entirely client-side with hardware acceleration. It covers installation, engine creation, chat completions, streaming, worker threads, service workers, Chrome extensions, and custom model integration.
Add the WebLLM package to your project:
```bash
npm install @mlc-ai/web-llm
```
For CDN usage in online editors (JSFiddle, Codepen):
```javascript
import * as webllm from "https://esm.run/@mlc-ai/web-llm";
```
Initialize the engine with model loading and progress tracking:
```typescript
import { CreateMLCEngine } from "@mlc-ai/web-llm";
const initProgressCallback = (initProgress) => {
console.log(initProgress);
};
const selectedModel = "Llama-3.1-8B-Instruct-q4f32_1-MLC";
const engine = await CreateMLCEngine(
selectedModel,
{ initProgressCallback: initProgressCallback }
);
```
**Note:** First-time model loading requires downloading and may take significant time.
Generate responses using OpenAI-style API:
```typescript
const messages = [
{ role: "system", content: "You are a helpful AI assistant." },
{ role: "user", content: "Hello!" }
];
const reply = await engine.chat.completions.create({
messages,
});
console.log(reply.choices[0].message);
console.log(reply.usage);
```
Implement real-time output generation:
```typescript
const chunks = await engine.chat.completions.create({
messages,
temperature: 1,
stream: true,
stream_options: { include_usage: true }
});
let reply = "";
for await (const chunk of chunks) {
reply += chunk.choices[0]?.delta.content || "";
console.log(reply);
if (chunk.usage) {
console.log(chunk.usage); // only last chunk has usage
}
}
```
Offload computation to prevent UI blocking:
**worker.ts:**
```typescript
import { WebWorkerMLCEngineHandler } from "@mlc-ai/web-llm";
const handler = new WebWorkerMLCEngineHandler();
self.onmessage = (msg: MessageEvent) => {
handler.onmessage(msg);
};
```
**main.ts:**
```typescript
import { CreateWebWorkerMLCEngine } from "@mlc-ai/web-llm";
const engine = await CreateWebWorkerMLCEngine(
new Worker(
new URL("./worker.ts", import.meta.url),
{ type: "module" }
),
selectedModel,
{ initProgressCallback }
);
```
Enable persistent model loading across page visits:
**sw.ts:**
```typescript
import { ServiceWorkerMLCEngineHandler } from "@mlc-ai/web-llm";
let handler: ServiceWorkerMLCEngineHandler;
self.addEventListener("activate", function (event) {
handler = new ServiceWorkerMLCEngineHandler();
console.log("Service Worker is ready");
});
```
**main.ts:**
```typescript
import { CreateServiceWorkerMLCEngine } from "@mlc-ai/web-llm";
if ("serviceWorker" in navigator) {
navigator.serviceWorker.register(
new URL("sw.ts", import.meta.url),
{ type: "module" }
);
}
const engine = await CreateServiceWorkerMLCEngine(
selectedModel,
{ initProgressCallback }
);
```
Enforce structured JSON output:
```typescript
const reply = await engine.chat.completions.create({
messages: [
{ role: "system", content: "You are a helpful assistant that responds in JSON." },
{ role: "user", content: "Generate user profile data" }
],
response_format: { type: "json_object" }
});
```
Use seeding for consistent results:
```typescript
const reply = await engine.chat.completions.create({
messages,
seed: 42,
temperature: 0.7
});
```
WebLLM supports multiple model families:
Access the complete list at `prebuiltAppConfig.model_list` or visit [MLC Models](https://mlc.ai/models).
Implement WebLLM in Chrome extensions with persistent background service workers. Check examples in the repository for full implementation patterns.
Integrate custom models in MLC format by compiling them with MLC LLM toolchain. Refer to [MLC LLM documentation](https://llm.mlc.ai/docs/deploy/webllm.html) for compilation instructions.
Implement function calling with `tools` and `tool_choice` parameters (preliminary support available).
Basic chatbot implementation available at:
Leave a review
No reviews yet. Be the first to review this skill!
# Download SKILL.md from killerskills.ai/api/skills/webllm-in-browser-inference/raw