Table of Contents
Fetching ...

Arena: A Patch-of-Interest ViT Inference Acceleration System for Edge-Assisted Video Analytics

Haosong Peng, Wei Feng, Hao Li, Yufeng Zhan, Ren Jin, Yuanqing Xia

TL;DR

Arena tackles real-time edge-assisted video analytics by exploiting Vision Transformer (ViT) strengths through Patch-of-Interest (PoI) pruning, enabling offloading of only informative regions to edge GPUs. It introduces Memory Token Pools (MTPs), Probability-based Patch Sampling (PPS), Memory Feature Reconstruction (MFR), and Adaptive Keyframe Interval Switching (AKIS) to balance accuracy and bandwidth, coupled with a two-phase keyframe/non-keyframe inference scheme. Across MOT17Det and AIC22, Arena delivers up to 1.58×–1.82× end-to-end acceleration while reducing bandwidth to 31–47% and maintaining low accuracy loss (≈1–5%), validated on a Jetson-edge and RTX-based testbed. This work demonstrates practical, scalable ViT-based video analytics at the edge and opens avenues for broader tasks and deployments.

Abstract

The advent of edge computing has made real-time intelligent video analytics feasible. Previous works, based on traditional model architecture (e.g., CNN, RNN, etc.), employ various strategies to filter out non-region-of-interest content to minimize bandwidth and computation consumption but show inferior performance in adverse environments. Recently, visual foundation models based on transformers have shown great performance in adverse environments due to their amazing generalization capability. However, they require a large amount of computation power, which limits their applications in real-time intelligent video analytics. In this paper, we find visual foundation models like Vision Transformer (ViT) also have a dedicated acceleration mechanism for video analytics. To this end, we introduce Arena, an end-to-end edge-assisted video inference acceleration system based on ViT. We leverage the capability of ViT that can be accelerated through token pruning by only offloading and feeding Patches-of-Interest to the downstream models. Additionally, we design an adaptive keyframe inference switching algorithm tailored to different videos, capable of adapting to the current video content to jointly optimize accuracy and bandwidth. Through extensive experiments, our findings reveal that Arena can boost inference speeds by up to 1.58\(\times\) and 1.82\(\times\) on average while consuming only 47\% and 31\% of the bandwidth, respectively, all with high inference accuracy.

Arena: A Patch-of-Interest ViT Inference Acceleration System for Edge-Assisted Video Analytics

TL;DR

Arena tackles real-time edge-assisted video analytics by exploiting Vision Transformer (ViT) strengths through Patch-of-Interest (PoI) pruning, enabling offloading of only informative regions to edge GPUs. It introduces Memory Token Pools (MTPs), Probability-based Patch Sampling (PPS), Memory Feature Reconstruction (MFR), and Adaptive Keyframe Interval Switching (AKIS) to balance accuracy and bandwidth, coupled with a two-phase keyframe/non-keyframe inference scheme. Across MOT17Det and AIC22, Arena delivers up to 1.58×–1.82× end-to-end acceleration while reducing bandwidth to 31–47% and maintaining low accuracy loss (≈1–5%), validated on a Jetson-edge and RTX-based testbed. This work demonstrates practical, scalable ViT-based video analytics at the edge and opens avenues for broader tasks and deployments.

Abstract

The advent of edge computing has made real-time intelligent video analytics feasible. Previous works, based on traditional model architecture (e.g., CNN, RNN, etc.), employ various strategies to filter out non-region-of-interest content to minimize bandwidth and computation consumption but show inferior performance in adverse environments. Recently, visual foundation models based on transformers have shown great performance in adverse environments due to their amazing generalization capability. However, they require a large amount of computation power, which limits their applications in real-time intelligent video analytics. In this paper, we find visual foundation models like Vision Transformer (ViT) also have a dedicated acceleration mechanism for video analytics. To this end, we introduce Arena, an end-to-end edge-assisted video inference acceleration system based on ViT. We leverage the capability of ViT that can be accelerated through token pruning by only offloading and feeding Patches-of-Interest to the downstream models. Additionally, we design an adaptive keyframe inference switching algorithm tailored to different videos, capable of adapting to the current video content to jointly optimize accuracy and bandwidth. Through extensive experiments, our findings reveal that Arena can boost inference speeds by up to 1.58 and 1.82 on average while consuming only 47\% and 31\% of the bandwidth, respectively, all with high inference accuracy.
Paper Structure (28 sections, 12 equations, 15 figures, 4 tables, 2 algorithms)

This paper contains 28 sections, 12 equations, 15 figures, 4 tables, 2 algorithms.

Figures (15)

  • Figure 1: Arena: our patch-of-interest ViT inference acceleration system for edge-assisted video analytics. Due to the limited computing power of the camera, the extracted patches-of-interest are offloaded to an edge server for processing with its more powerful GPUs. MTPs stands for Memory Token Pools.
  • Figure 2: The inference latency for three strategies: Full Frame, Masked Frame, and RoIs separately. Downstream models based on CNNs fail to benefit from the filtered RoIs.
  • Figure 3: Pruning patches can accelerate ViT backbone inference. We evaluate the impact of pruning 25%, 50%, and 75% tokens on (a) inference latency and (b) GFlops using ViT-base across videos of 1080p, 720p, and 480p resolutions. Similar trends were found in ViT-Base.
  • Figure 4: Keyframe interval introduces accuracy and bandwidth trade-off in different scenes.
  • Figure 5: The overview of Arena. Given $K$ continuous frames $\{\hat{\mathbf{x}}^1, \mathbf{x}^2, \ldots, \mathbf{x}^K\}$ in an interval, Arena periodically operates in two distinct phases: keyframe inference (Left) for the first frame $\hat{\mathbf{x}}^1$ and non-keyframe inference (Right) for the rest of the frames. AKIS (Down), deployed on the camera, utilizes information from historical frames to determine subsequent keyframe intervals. Notably, we split the frame into nine patches only for demonstration.
  • ...and 10 more figures