Table of Contents
Fetching ...

TAM-VT: Transformation-Aware Multi-scale Video Transformer for Segmentation and Tracking

Raghav Goyal, Wan-Cyuan Fan, Mennatullah Siam, Leonid Sigal

TL;DR

TAM-VT introduces a Transformation-Aware Multi-scale Video Transformer for Semi-VOS that processes long egocentric videos via clip-based memory and a DETR-style encoder-decoder. The approach couples a multi-scale matching encoder with a clip-based memory and a multi-scale decoder to enable accurate tracking of small objects undergoing complex deformations, aided by a novel transformation-aware loss and a multiplicative time-coded memory (RTE). Key contributions include (1) a holistic multi-scale memory matching/decoding framework, (2) a clip-based memory module with online inference, (3) a transformation-aware reweighting strategy for focused learning on transformation frames, and (4) state-of-the-art performance on VISOR and VOST, with strong results on DAVIS'17. The work demonstrates significant gains in long videos and small-object regimes, validating its practical impact for real-world Semi-VOS tasks in egocentric settings, while offering thorough ablations and analysis to guide future memory-based video transformers.

Abstract

Video Object Segmentation (VOS) has emerged as an increasingly important problem with availability of larger datasets and more complex and realistic settings, which involve long videos with global motion (e.g, in egocentric settings), depicting small objects undergoing both rigid and non-rigid (including state) deformations. While a number of recent approaches have been explored for this task, these data characteristics still present challenges. In this work we propose a novel, clip-based DETR-style encoder-decoder architecture, which focuses on systematically analyzing and addressing aforementioned challenges. Specifically, we propose a novel transformation-aware loss that focuses learning on portions of the video where an object undergoes significant deformations -- a form of "soft" hard examples mining. Further, we propose a multiplicative time-coded memory, beyond vanilla additive positional encoding, which helps propagate context across long videos. Finally, we incorporate these in our proposed holistic multi-scale video transformer for tracking via multi-scale memory matching and decoding to ensure sensitivity and accuracy for long videos and small objects. Our model enables on-line inference with long videos in a windowed fashion, by breaking the video into clips and propagating context among them. We illustrate that short clip length and longer memory with learned time-coding are important design choices for improved performance. Collectively, these technical contributions enable our model to achieve new state-of-the-art (SoTA) performance on two complex egocentric datasets -- VISOR and VOST, while achieving comparable to SoTA results on the conventional VOS benchmark, DAVIS'17. A series of detailed ablations validate our design choices as well as provide insights into the importance of parameter choices and their impact on performance.

TAM-VT: Transformation-Aware Multi-scale Video Transformer for Segmentation and Tracking

TL;DR

TAM-VT introduces a Transformation-Aware Multi-scale Video Transformer for Semi-VOS that processes long egocentric videos via clip-based memory and a DETR-style encoder-decoder. The approach couples a multi-scale matching encoder with a clip-based memory and a multi-scale decoder to enable accurate tracking of small objects undergoing complex deformations, aided by a novel transformation-aware loss and a multiplicative time-coded memory (RTE). Key contributions include (1) a holistic multi-scale memory matching/decoding framework, (2) a clip-based memory module with online inference, (3) a transformation-aware reweighting strategy for focused learning on transformation frames, and (4) state-of-the-art performance on VISOR and VOST, with strong results on DAVIS'17. The work demonstrates significant gains in long videos and small-object regimes, validating its practical impact for real-world Semi-VOS tasks in egocentric settings, while offering thorough ablations and analysis to guide future memory-based video transformers.

Abstract

Video Object Segmentation (VOS) has emerged as an increasingly important problem with availability of larger datasets and more complex and realistic settings, which involve long videos with global motion (e.g, in egocentric settings), depicting small objects undergoing both rigid and non-rigid (including state) deformations. While a number of recent approaches have been explored for this task, these data characteristics still present challenges. In this work we propose a novel, clip-based DETR-style encoder-decoder architecture, which focuses on systematically analyzing and addressing aforementioned challenges. Specifically, we propose a novel transformation-aware loss that focuses learning on portions of the video where an object undergoes significant deformations -- a form of "soft" hard examples mining. Further, we propose a multiplicative time-coded memory, beyond vanilla additive positional encoding, which helps propagate context across long videos. Finally, we incorporate these in our proposed holistic multi-scale video transformer for tracking via multi-scale memory matching and decoding to ensure sensitivity and accuracy for long videos and small objects. Our model enables on-line inference with long videos in a windowed fashion, by breaking the video into clips and propagating context among them. We illustrate that short clip length and longer memory with learned time-coding are important design choices for improved performance. Collectively, these technical contributions enable our model to achieve new state-of-the-art (SoTA) performance on two complex egocentric datasets -- VISOR and VOST, while achieving comparable to SoTA results on the conventional VOS benchmark, DAVIS'17. A series of detailed ablations validate our design choices as well as provide insights into the importance of parameter choices and their impact on performance.
Paper Structure (36 sections, 6 equations, 10 figures, 6 tables)

This paper contains 36 sections, 6 equations, 10 figures, 6 tables.

Figures (10)

  • Figure 1: Illustration of Our Approach. Given the first frame and the mask for the object of interest, our model adeptly tracks the object through the video via producing a sequence of segmentation masks. Despite potential transformations of the object, our approach achieves better performance on long videos (> 20 sec) and videos with small object (< 0.5% frame area). $\dag$ denotes results reproduced by us using official code.
  • Figure 2: Overview of TAM-VT. We divide an input video into non-overlapping clips of length $L$. For a query clip, we retrieve information on previous clips from our (a) Clip-based Memory in the form of frames and predicted (or initial reference) masks. We use a 2D-CNN backbone to obtain features for query frames $X^q$, and features for memory frames and masks, $X^M$ and $Y^M$ respectively. We then use our proposed (b) Multi-Scale Matching Encoder to perform dense matching at multiple-scales with frame features of the query clip $X^q$ and memory frames $X^M$, and use the resulting similarity between frames to obtain query clip's mask features $Y^{q, enc}$ as a weighted combination of the memory mask features $Y^M$. In doing so, we modulate the similarity using our proposed multiplicative Relative-Time Encoding (RTE) to learn recency of information in memory, thereby facilitating propagation over long-time spans. We then use (c) Multi-Scale Decoder to aggregate the resulting query clip's mask features $Y^{q, enc}$ with clip's frame features $X^q$ using Pixel Decoder to give contexualized feature pyramid $Y^{q, fpn}$. Finally, we use Space-Time Decoder to decode mask predictions $\hat{Y}^q$, by refining learned time embeddings on the contextualized feature pyramid $Y^{q, fpn}$. We update the memory with the predictions from last frame (=$L^{th}$ index) in the query clip, implemented as FIFO queue. During training, we use our transformation-aware loss $\mathcal{L}^{tr}$ to form segmentation loss for the entire video.
  • Figure 3: Qualitative comparison on VOST. Best viewed in color; red indicates incorrect predictions. Our method is able to track and delineate the object's boundary with fine details under complex deformation, e.g. cutting onions, compared to best prior work, AOT yang2021associating.
  • Figure 4: (a) Visualization of RTE. Value in each grid represents the learned importance score for each frame w.r.t. different number of frames in the memory ($n$). Lighter colors indicate higher scores. In each row, index $0$ denotes the nearest frame, last index denotes the first frame, and remaining indices denote intermediate frames in memory. (b) Visualization of the attention maps in the multi-scale matching modules. Scale 1 has a resolution of $\frac{1}{32}$ of the input frame, and scale 2 is at $\frac{1}{16}$ resolution. Best viewed in color; lighter colors indicate higher attention scores. The red box in the first frame denotes the object of interest in the query frame.
  • Figure A5: Performance breakdown w.r.t. (a) video length and (b) small object size. We show performance comparison of AOT yang2021associating and our method on subsets of different video lengths and object sizes. (a) We observe that 1) performance decreases in long duration scenarios, demonstrating the complexity in long videos, and 2) our method generally outperforms AOT yang2021associating on long-range subsets, especially upto $10\%$ on videos longer than $34$ secs. (b) We observe that 1) with smaller object size, the performance decreases confirming the complexity of the task, and 2) our method outperforms on all small object subsets compared to AOT yang2021associating demonstrating the effectiveness of our method on small objects (SM). # video denotes the number of videos.
  • ...and 5 more figures