WebLLM: A High-Performance In-Browser LLM Inference Engine
Charlie F. Ruan, Yucheng Qin, Xun Zhou, Ruihang Lai, Hongyi Jin, Yixin Dong, Bohan Hou, Meng-Shiun Yu, Yiyan Zhai, Sudeep Agarwal, Hangrui Cao, Siyuan Feng, Tianqi Chen
TL;DR
This work tackles the challenge of deploying large language models directly in web browsers to enhance privacy and reduce server dependence. It presents WebLLM, a JavaScript framework that runs LLM inference locally using WebGPU for GPU acceleration and WebAssembly for CPU workloads, organized around an endpoint-like frontend and a background worker runtime. By leveraging MLC-LLM and Apache TVM to compile efficient WebGPU kernels, WebLLM achieves up to about 80% of native performance on consumer hardware while preserving a simple OpenAI-style API and streaming capabilities. The results demonstrate the practicality of universally accessible, privacy-preserving, browser-based LLM applications, and the work outlines a concrete path toward broader on-device AI capabilities in web ecosystems.
Abstract
Advancements in large language models (LLMs) have unlocked remarkable capabilities. While deploying these models typically requires server-grade GPUs and cloud-based inference, the recent emergence of smaller open-source models and increasingly powerful consumer devices have made on-device deployment practical. The web browser as a platform for on-device deployment is universally accessible, provides a natural agentic environment, and conveniently abstracts out the different backends from diverse device vendors. To address this opportunity, we introduce WebLLM, an open-source JavaScript framework that enables high-performance LLM inference entirely within web browsers. WebLLM provides an OpenAI-style API for seamless integration into web applications, and leverages WebGPU for efficient local GPU acceleration and WebAssembly for performant CPU computation. With machine learning compilers MLC-LLM and Apache TVM, WebLLM leverages optimized WebGPU kernels, overcoming the absence of performant WebGPU kernel libraries. Evaluations show that WebLLM can retain up to 80% native performance on the same device, with room to further close the gap. WebLLM paves the way for universally accessible, privacy-preserving, personalized, and locally powered LLM applications in web browsers. The code is available at: https://github.com/mlc-ai/web-llm.
