Table of Contents
Fetching ...

Empowering In-Browser Deep Learning Inference on Edge Devices with Just-in-Time Kernel Optimizations

Fucheng Jia, Shiqi Jiang, Ting Cao, Wei Cui, Tianrui Xia, Xu Cao, Yuanchun Li, Deyu Zhang, Ju Ren, Yunxin Liu, Lili Qiu, Mao Yang

TL;DR

This paper tackles the challenge of performing efficient deep learning inference directly in web browsers on edge devices, where device heterogeneity and limited Web hardware acceleration create performance gaps. It introduces nnJIT, a first-of-its-kind in-browser inference system that performs just-in-time auto-generation of optimized kernels through two key innovations: Tensor-Web compiling co-design, which unifies tensor-level and Wasm-level compilation to dramatically cut per-candidate cost, and a Web-specific lite kernel optimization space, which prunes the search space from millions to dozens using web- and device-aware heuristics. The system comprises a tensor-web JIT compiler, an inference engine, a microbenchmark suite, and a kernel database, and it supports both Wasm (CPU) and WebGPU (GPU) backends with online kernel evaluation and crowdsourcing to accelerate convergence. Evaluated on modern transformer models (e.g., T5, BART, GPT-2, RoBERTa, Llama 2 7B) across diverse devices and browsers, nnJIT delivers up to $8.2\times$ speedups in model inference and substantial reductions in kernel compilation time, while maintaining modest memory overhead. The work advances practical, privacy-preserving in-browser AI and provides a blueprint for adaptive, device-aware WebDL infrastructure that scales with the growing landscape of edge hardware.

Abstract

Web is increasingly becoming the primary platform to deliver AI services onto edge devices, making in-browser deep learning (DL) inference more prominent. Nevertheless, the heterogeneity of edge devices, combined with the underdeveloped state of Web hardware acceleration practices, hinders current in-browser inference from achieving its full performance potential on target devices. To address this issue, this paper presents the pioneering inbrowser inference system, nnJIT, which enables just-in-time (JIT) auto-generation of optimized computing kernels for edge devices. nnJIT is built upon two novel techniques that significantly reduce kernel search and compilation overhead while improving performance firmly: Tensor-Web Compiling Co-Design lowers compiling costs by around 100X through eliminating redundant and ineffective compiling passes; Web-Specific Lite Kernel Optimization Space reduces kernel tuning costs by focusing on Web programming requirements and efficient device resource utilization, pruning the optimization space from millions to only dozens. nnJIT is evaluated for modern models, e.g., BART, T5, and Llama 2, on a range of edge devices including laptops and smartphones using different browsers and hardware from ARM, Intel, AMD and Nvidia. The results show that nnJIT can achieve up to 8.2X faster within 30 seconds compared to the existing baselines.

Empowering In-Browser Deep Learning Inference on Edge Devices with Just-in-Time Kernel Optimizations

TL;DR

This paper tackles the challenge of performing efficient deep learning inference directly in web browsers on edge devices, where device heterogeneity and limited Web hardware acceleration create performance gaps. It introduces nnJIT, a first-of-its-kind in-browser inference system that performs just-in-time auto-generation of optimized kernels through two key innovations: Tensor-Web compiling co-design, which unifies tensor-level and Wasm-level compilation to dramatically cut per-candidate cost, and a Web-specific lite kernel optimization space, which prunes the search space from millions to dozens using web- and device-aware heuristics. The system comprises a tensor-web JIT compiler, an inference engine, a microbenchmark suite, and a kernel database, and it supports both Wasm (CPU) and WebGPU (GPU) backends with online kernel evaluation and crowdsourcing to accelerate convergence. Evaluated on modern transformer models (e.g., T5, BART, GPT-2, RoBERTa, Llama 2 7B) across diverse devices and browsers, nnJIT delivers up to speedups in model inference and substantial reductions in kernel compilation time, while maintaining modest memory overhead. The work advances practical, privacy-preserving in-browser AI and provides a blueprint for adaptive, device-aware WebDL infrastructure that scales with the growing landscape of edge hardware.

Abstract

Web is increasingly becoming the primary platform to deliver AI services onto edge devices, making in-browser deep learning (DL) inference more prominent. Nevertheless, the heterogeneity of edge devices, combined with the underdeveloped state of Web hardware acceleration practices, hinders current in-browser inference from achieving its full performance potential on target devices. To address this issue, this paper presents the pioneering inbrowser inference system, nnJIT, which enables just-in-time (JIT) auto-generation of optimized computing kernels for edge devices. nnJIT is built upon two novel techniques that significantly reduce kernel search and compilation overhead while improving performance firmly: Tensor-Web Compiling Co-Design lowers compiling costs by around 100X through eliminating redundant and ineffective compiling passes; Web-Specific Lite Kernel Optimization Space reduces kernel tuning costs by focusing on Web programming requirements and efficient device resource utilization, pruning the optimization space from millions to only dozens. nnJIT is evaluated for modern models, e.g., BART, T5, and Llama 2, on a range of edge devices including laptops and smartphones using different browsers and hardware from ARM, Intel, AMD and Nvidia. The results show that nnJIT can achieve up to 8.2X faster within 30 seconds compared to the existing baselines.
Paper Structure (21 sections, 16 figures, 6 tables, 1 algorithm)

This paper contains 21 sections, 16 figures, 6 tables, 1 algorithm.

Figures (16)

  • Figure 1: The Wasm and WebGPU support in browser.
  • Figure 2: The normalized kernel latency of handwritten, pre-tuned, and our nnJIT for a MatMul ([M,K,N]=[640,768,2304]).
  • Figure 3: The generated MatMul kernel ([M,K,N]=[640,768,2304]) performance and generation time of TVM on AMD 5800H CPU.
  • Figure 4: A common kernel generation pipeline.
  • Figure 5: Overview of nnJIT.
  • ...and 11 more figures