Table of Contents
Fetching ...

Janus: Collaborative Vision Transformer Under Dynamic Network Environment

Linyi Jiang, Silvery D. Fu, Yifei Zhu, Bo Li

TL;DR

This paper tackles the challenge of running accurate Vision Transformer (ViT) models under dynamic network conditions by introducing Janus, a cloud-device collaborative inference framework. Janus combines collaboration-aware token pruning with a fine-to-coarse model splitter, guided by an offline latency profiler and a real-time dynamic scheduler to adapt to network SLA and bandwidth. Key contributions include an exponential token pruning policy with Δx_l = $\lfloor2^{\alpha(N-l)}\rfloor$, a forward-leaning candidate set of splitting points, a linear latency profiler, and a lightweight runtime coordinating edge-cloud execution. Experiments on real hardware and network traces show Janus achieves up to $5.15\times$ throughput improvement and up to $98.7\%$ reductions in latency violation ratios, with minimal accuracy loss, demonstrating practical viability for transformer-based vision tasks in edge-cloud environments.

Abstract

Vision Transformers (ViTs) have outperformed traditional Convolutional Neural Network architectures and achieved state-of-the-art results in various computer vision tasks. Since ViTs are computationally expensive, the models either have to be pruned to run on resource-limited edge devices only or have to be executed on remote cloud servers after receiving the raw data transmitted over fluctuating networks. The resulting degraded performance or high latency all hinder their widespread applications. In this paper, we present Janus, the first framework for low-latency cloud-device collaborative Vision Transformer inference over dynamic networks. Janus overcomes the intrinsic model limitations of ViTs and realizes collaboratively executing ViT models on both cloud and edge devices, achieving low latency, high accuracy, and low communication overhead. Specifically, Janus judiciously combines token pruning techniques with a carefully designed fine-to-coarse model splitting policy and non-static mixed pruning policy. It attains a balance between accuracy and latency by dynamically selecting the optimal pruning level and split point. Experimental results across various tasks demonstrate that Janus enhances throughput by up to 5.15 times and reduces latency violation ratios by up to 98.7% when compared with baseline approaches under various network environments.

Janus: Collaborative Vision Transformer Under Dynamic Network Environment

TL;DR

This paper tackles the challenge of running accurate Vision Transformer (ViT) models under dynamic network conditions by introducing Janus, a cloud-device collaborative inference framework. Janus combines collaboration-aware token pruning with a fine-to-coarse model splitter, guided by an offline latency profiler and a real-time dynamic scheduler to adapt to network SLA and bandwidth. Key contributions include an exponential token pruning policy with Δx_l = , a forward-leaning candidate set of splitting points, a linear latency profiler, and a lightweight runtime coordinating edge-cloud execution. Experiments on real hardware and network traces show Janus achieves up to throughput improvement and up to reductions in latency violation ratios, with minimal accuracy loss, demonstrating practical viability for transformer-based vision tasks in edge-cloud environments.

Abstract

Vision Transformers (ViTs) have outperformed traditional Convolutional Neural Network architectures and achieved state-of-the-art results in various computer vision tasks. Since ViTs are computationally expensive, the models either have to be pruned to run on resource-limited edge devices only or have to be executed on remote cloud servers after receiving the raw data transmitted over fluctuating networks. The resulting degraded performance or high latency all hinder their widespread applications. In this paper, we present Janus, the first framework for low-latency cloud-device collaborative Vision Transformer inference over dynamic networks. Janus overcomes the intrinsic model limitations of ViTs and realizes collaboratively executing ViT models on both cloud and edge devices, achieving low latency, high accuracy, and low communication overhead. Specifically, Janus judiciously combines token pruning techniques with a carefully designed fine-to-coarse model splitting policy and non-static mixed pruning policy. It attains a balance between accuracy and latency by dynamically selecting the optimal pruning level and split point. Experimental results across various tasks demonstrate that Janus enhances throughput by up to 5.15 times and reduces latency violation ratios by up to 98.7% when compared with baseline approaches under various network environments.

Paper Structure

This paper contains 21 sections, 3 equations, 9 figures, 2 tables, 1 algorithm.

Figures (9)

  • Figure 1: Comparing the existing architectures of serving Vision Transformer (ViT) and Janus: a device-cloud collaborative system that adapts ViT for dynamic networks.
  • Figure 2: Inference latency breakdown for ViT-B.
  • Figure 3: System overview of Janus.
  • Figure 4: The fine-to-coarse candidate splitting points generating policy applied to a ViT with 12 layers when k is set to 3. The crossed indexes represent split points that are removed after applying the splitting policy.
  • Figure 5: Layer latency of ViTs across different numbers of tokens.
  • ...and 4 more figures