Table of Contents
Fetching ...

Video Patch Pruning: Efficient Video Instance Segmentation via Early Token Reduction

Patrick Glandorf, Thomas Norrenbrock, Bodo Rosenhahn

Abstract

Vision Transformers (ViTs) have demonstrated state-ofthe-art performance in several benchmarks, yet their high computational costs hinders their practical deployment. Patch Pruning offers significant savings, but existing approaches restrict token reduction to deeper layers, leaving early-stage compression unexplored. This limits their potential for holistic efficiency. In this work, we present a novel Video Patch Pruning framework (VPP) that integrates temporal prior knowledge to enable efficient sparsity within early ViT layers. Our approach is motivated by the observation that prior features extracted from deeper layers exhibit strong foreground selectivity. Therefore we propose a fully differentiable module for temporal mapping to accurately select the most relevant patches in early network stages. Notably, the proposed method enables a patch reduction of up to 60% in dense prediction tasks, exceeding the capabilities of conventional image-based patch pruning, which typically operate around a 30% patch sparsity. VPP excels the high-sparsity regime, sustaining remarkable performance even when patch usage is reduced below 55%. Specifically, it preserves stable results with a maximal performance drop of 0.6% on the Youtube-VIS 2021 dataset.

Video Patch Pruning: Efficient Video Instance Segmentation via Early Token Reduction

Abstract

Vision Transformers (ViTs) have demonstrated state-ofthe-art performance in several benchmarks, yet their high computational costs hinders their practical deployment. Patch Pruning offers significant savings, but existing approaches restrict token reduction to deeper layers, leaving early-stage compression unexplored. This limits their potential for holistic efficiency. In this work, we present a novel Video Patch Pruning framework (VPP) that integrates temporal prior knowledge to enable efficient sparsity within early ViT layers. Our approach is motivated by the observation that prior features extracted from deeper layers exhibit strong foreground selectivity. Therefore we propose a fully differentiable module for temporal mapping to accurately select the most relevant patches in early network stages. Notably, the proposed method enables a patch reduction of up to 60% in dense prediction tasks, exceeding the capabilities of conventional image-based patch pruning, which typically operate around a 30% patch sparsity. VPP excels the high-sparsity regime, sustaining remarkable performance even when patch usage is reduced below 55%. Specifically, it preserves stable results with a maximal performance drop of 0.6% on the Youtube-VIS 2021 dataset.

Paper Structure

This paper contains 24 sections, 11 equations, 10 figures, 8 tables.

Figures (10)

  • Figure 1: Pruning masks for image-based (SViT) and video-based (VPP) pruning at 60% avg. patch reduction. For instance segmentation, only the highlighted objects must be segmented, while the background can be ignored. VPP uses temporal information for early-stage pruning, effectively preserving the foreground.
  • Figure 2: Foreground Selectivity (FGS) across layers. This plot shows the ability of feature x to identify patches belonging to an object instance. Initial features (idx: 0) and early blocks (idx: 1-3) significantly lack in Foreground Selectivity.
  • Figure 3: Mapping-Selective Module (Map-SM). Using the features from preceding frame ($x^{t-1}_{1}$), Map-SM calculates an association matrix $A$ to establish patch correspondences in order to map high-level features $x^{t-1}_{6}$ onto the current frame. Mask $M_6^{t-1}$ ensures that only the remaining patches from the preceding frame are considered for the mask selection process.
  • Figure 4: Performance Loss in AP vs. Pruned Patches on Youtube-VIS 2019 and 2021, using model size small.
  • Figure 5: Patch density per layer. VPP removes patches after layer 1, while maintaining high patch density within the deepest layers (7-12). In contrast, image-based SViT requires dense initial layers, leaving the deepest layers with insufficient patch density.
  • ...and 5 more figures