Table of Contents
Fetching ...

Anatomizing Deep Learning Inference in Web Browsers

Qipeng Wang, Shiqi Jiang, Zhenpeng Chen, Xu Cao, Yuanchun Li, Aoyu Li, Yun Ma, Ting Cao, Xuanzhe Liu

TL;DR

This work provides the first comprehensive evaluation of in-browser deep learning inference, introducing QoE-specific metrics (responsiveness, smoothness, inference accuracy) and measuring 9 representative models across 50 PC and 20 mobile devices using TF.js and ORT.js with Wasm and WebGL backends. It reveals large latency gaps relative to native inference (up to 16.9x on CPU and 30.6x on GPU on PCs; 15.8x and 7.8x on mobile), driven by limited SIMD support in Wasm, WebGL GPU abstractions, browser overhead, and substantial memory footprints (e.g., ESRGAN reaching 6.5 GB, 334.6x the model size). QoE is also degraded due to resource competition, with responsiveness, smoothness, and inference accuracy all affected, though discrete GPUs tend to yield the best QoE. The study offers practical implications for browser vendors, framework developers, and web app creators, advocating memory-pool strategies, ahead-of-time WebGL binaries, and adaptive model/backends to balance QoE in real-world scenarios, and it publicly releases code and data to enable replication and extension.

Abstract

Web applications have increasingly adopted Deep Learning (DL) through in-browser inference, wherein DL inference performs directly within Web browsers. The actual performance of in-browser inference and its impacts on the quality of experience (QoE) remain unexplored, and urgently require new QoE measurements beyond traditional ones, e.g., mainly focusing on page load time. To bridge this gap, we make the first comprehensive performance measurement of in-browser inference to date. Our approach proposes new metrics to measure in-browser inference: responsiveness, smoothness, and inference accuracy. Our extensive analysis involves 9 representative DL models across Web browsers of 50 popular PC devices and 20 mobile devices. The results reveal that in-browser inference exhibits a substantial latency gap, averaging 16.9 times slower on CPU and 4.9 times slower on GPU compared to native inference on PC devices. The gap on mobile CPU and mobile GPU is 15.8 times and 7.8 times, respectively. Furthermore, we identify contributing factors to such latency gap, including underutilized hardware instruction sets, inherent overhead in the runtime environment, resource contention within the browser, and inefficiencies in software libraries and GPU abstractions. Additionally, in-browser inference imposes significant memory demands, at times exceeding 334.6 times the size of the DL models themselves, partly attributable to suboptimal memory management. We also observe that in-browser inference leads to a significant 67.2% increase in the time it takes for GUI components to render within Web browsers, significantly affecting the overall user QoE of Web applications reliant on this technology

Anatomizing Deep Learning Inference in Web Browsers

TL;DR

This work provides the first comprehensive evaluation of in-browser deep learning inference, introducing QoE-specific metrics (responsiveness, smoothness, inference accuracy) and measuring 9 representative models across 50 PC and 20 mobile devices using TF.js and ORT.js with Wasm and WebGL backends. It reveals large latency gaps relative to native inference (up to 16.9x on CPU and 30.6x on GPU on PCs; 15.8x and 7.8x on mobile), driven by limited SIMD support in Wasm, WebGL GPU abstractions, browser overhead, and substantial memory footprints (e.g., ESRGAN reaching 6.5 GB, 334.6x the model size). QoE is also degraded due to resource competition, with responsiveness, smoothness, and inference accuracy all affected, though discrete GPUs tend to yield the best QoE. The study offers practical implications for browser vendors, framework developers, and web app creators, advocating memory-pool strategies, ahead-of-time WebGL binaries, and adaptive model/backends to balance QoE in real-world scenarios, and it publicly releases code and data to enable replication and extension.

Abstract

Web applications have increasingly adopted Deep Learning (DL) through in-browser inference, wherein DL inference performs directly within Web browsers. The actual performance of in-browser inference and its impacts on the quality of experience (QoE) remain unexplored, and urgently require new QoE measurements beyond traditional ones, e.g., mainly focusing on page load time. To bridge this gap, we make the first comprehensive performance measurement of in-browser inference to date. Our approach proposes new metrics to measure in-browser inference: responsiveness, smoothness, and inference accuracy. Our extensive analysis involves 9 representative DL models across Web browsers of 50 popular PC devices and 20 mobile devices. The results reveal that in-browser inference exhibits a substantial latency gap, averaging 16.9 times slower on CPU and 4.9 times slower on GPU compared to native inference on PC devices. The gap on mobile CPU and mobile GPU is 15.8 times and 7.8 times, respectively. Furthermore, we identify contributing factors to such latency gap, including underutilized hardware instruction sets, inherent overhead in the runtime environment, resource contention within the browser, and inefficiencies in software libraries and GPU abstractions. Additionally, in-browser inference imposes significant memory demands, at times exceeding 334.6 times the size of the DL models themselves, partly attributable to suboptimal memory management. We also observe that in-browser inference leads to a significant 67.2% increase in the time it takes for GUI components to render within Web browsers, significantly affecting the overall user QoE of Web applications reliant on this technology
Paper Structure (25 sections, 3 equations, 6 figures, 21 tables)

This paper contains 25 sections, 3 equations, 6 figures, 21 tables.

Figures (6)

  • Figure 1: Workflow of in-browser inference. "MB" denotes memory block.
  • Figure 2: ResNet50 average latency. "-I/-D" denotes integrated/discrete GPU. "-S/-T" denotes SIMD/multithreading. We found that for ResNet50, both CPU and GPU consistently offer lower native inference prediction latency. Specifically, on CPU, the average native prediction latency is 5.5$\times$ lower on average, while on GPU it is 10.5$\times$. For both in-browser inference and native inference, using GPU achieves lower latency compared to CPU, being 3.0$\times$ and 5.8$\times$, respectively.
  • Figure 3: Prediction latency distribution. "M*" is model ID and is the same with §\ref{['subsec:experimental_setup']}. We found that the variance in prediction latency reached 28.4$\times$ on the Wasm backend and 19.4$\times$ on the WebGL backend for all models and both frameworks, primarily due to differences in device hardware performance.
  • Figure 4: Warmup latency distribution. "M*" is model ID and is the same with §\ref{['subsec:experimental_setup']}. We found that the variance in warmup latency reached 25.3$\times$ on the Wasm backend and 14.4$\times$ on the WebGL backend for all models and both frameworks.
  • Figure 5: Memory footprint analysis. We found that: (a) Memory-intensive kernels can occupy a significant memory footprint. For example, the memory requirements for Reshape are second only to FusedConv2D, even though their latency proportions are quite low. (b) In most cases, inference memory tends to stabilize as the warmup progresses. However, for super-resolution models, the inference memory footprint continues to grow because the size of the intermediate data generated during inference gradually increases.
  • ...and 1 more figures