A Simple Video Segmenter by Tracking Objects Along Axial Trajectories

Ju He; Qihang Yu; Inkyu Shin; Xueqing Deng; Alan Yuille; Xiaohui Shen; Liang-Chieh Chen

A Simple Video Segmenter by Tracking Objects Along Axial Trajectories

Ju He, Qihang Yu, Inkyu Shin, Xueqing Deng, Alan Yuille, Xiaohui Shen, Liang-Chieh Chen

TL;DR

Axial-VS tackles memory bottlenecks in video segmentation by leveraging clip-level processing and a novel axial-trajectory attention that tracks object motion along height and width axes. It adds two tracking modules—within-clip and cross-clip—to enforce temporal consistency inside clips and across the entire video, building on top of existing clip-level segmenters. The approach achieves state-of-the-art or competitive performance on VPS and VIS benchmarks, with ablations confirming the effectiveness of the proposed attention and tracking design. The framework is simple, general, and scalable to high-resolution videos, offering strong practical impact for video understanding tasks.

Abstract

Video segmentation requires consistently segmenting and tracking objects over time. Due to the quadratic dependency on input size, directly applying self-attention to video segmentation with high-resolution input features poses significant challenges, often leading to insufficient GPU memory capacity. Consequently, modern video segmenters either extend an image segmenter without incorporating any temporal attention or resort to window space-time attention in a naive manner. In this work, we present Axial-VS, a general and simple framework that enhances video segmenters by tracking objects along axial trajectories. The framework tackles video segmentation through two sub-tasks: short-term within-clip segmentation and long-term cross-clip tracking. In the first step, Axial-VS augments an off-the-shelf clip-level video segmenter with the proposed axial-trajectory attention, sequentially tracking objects along the height- and width-trajectories within a clip, thereby enhancing temporal consistency by capturing motion trajectories. The axial decomposition significantly reduces the computational complexity for dense features, and outperforms the window space-time attention in segmentation quality. In the second step, we further employ axial-trajectory attention to the object queries in clip-level segmenters, which are learned to encode object information, thereby aiding object tracking across different clips and achieving consistent segmentation throughout the video. Without bells and whistles, Axial-VS showcases state-of-the-art results on video segmentation benchmarks, emphasizing its effectiveness in addressing the limitations of modern clip-level video segmenters. Code and models are available at https://github.com/TACJu/Axial-VS.

A Simple Video Segmenter by Tracking Objects Along Axial Trajectories

TL;DR

Abstract

Paper Structure (19 sections, 6 equations, 18 figures, 13 tables)

This paper contains 19 sections, 6 equations, 18 figures, 13 tables.

Introduction
Related Work
Method
Video Segmentation with Clip-level Segmenter
Within-Clip Tracking Module
Cross-Clip Tracking Module
Experimental Results
Improvements over Baselines
Comparisons with Other Methods
Ablation Studies
Conclusion
Implementation Details
Additional Experimental Results
GFLOPs, FPS and VRAM Comparisons
Comparisons with Other Methods
...and 4 more sections

Figures (18)

Figure 1: Visualization of Learned Axial-Trajectory Attention. In this short clip depicting the action 'playing basketball', the basketball location at frame 1 is selected as the reference point (mark in red). We multiply the learned height and width axial-trajectory attentions and overlay them on frame 2, 3 and 4 to visualize the trajectory of the reference point over time. As observed, the axial-trajectory attention can capture the basketball's motion path.
Figure 2: Overview of Axial-VS, which builds two components on top of a clip-level segmenter (blue): the within-clip tracking and cross-clip tracking modules (orange). Both modules exploit the axial-trajectory attention to enhance temporal consistency. We obtain video features by concatenating all clip features output by the pixel decoder (totally $K$ clips), and video prediction by multiplying ($\bigotimes$) video features and refined clip object queries.
Figure 3: Within-clip tracking module takes input clip features extracted by the network backbone, iteratively stacks Multi-Scale Deformable (MSDeform) Attention and axial-trajectory attention (sequentially along H- and W-axes) for $N_w$ times, and outputs the spatially and temporally consistent clip features.
Figure 4: Illustration of Axial-Trajectory Attention (only Height-axis axial-trajectory attention is shown for simplicity), which includes two steps: computing the axial-trajectories $\widetilde{y}$ along Height-axis (Eq. \ref{['equ:traject-attn1']}) of the dense pixel feature maps $x \in \mathbb{R}^{TH \times D}$, where $T$, $H$, and $D$ denote the clip length, feature height and channels, respectively and then computing temporal attention along the axial-trajectories (Eq. \ref{['equ:temporal-attn1']}) to obtain the temporally consistent features $y$.
Figure 5: Cross-clip tracking module refines K sets of clip object queries by performing axial-trajectory attention and temporal atrous spatial pyramid pooing (Temporal-ASPP) for $N_c$ times.
...and 13 more figures

A Simple Video Segmenter by Tracking Objects Along Axial Trajectories

TL;DR

Abstract

A Simple Video Segmenter by Tracking Objects Along Axial Trajectories

Authors

TL;DR

Abstract

Table of Contents

Figures (18)