Table of Contents
Fetching ...

Depth-aware Test-Time Training for Zero-shot Video Object Segmentation

Weihuang Liu, Xi Shen, Haolun Li, Xiuli Bi, Bo Liu, Chi-Man Pun, Xiaodong Cun

TL;DR

This work introduces a test-time training (TTT) strategy to address the problem of zero-shot Video Object Segmentation, and suggests that the momentum-based weight initialization and looping-based training scheme lead to more stable improvements.

Abstract

Zero-shot Video Object Segmentation (ZSVOS) aims at segmenting the primary moving object without any human annotations. Mainstream solutions mainly focus on learning a single model on large-scale video datasets, which struggle to generalize to unseen videos. In this work, we introduce a test-time training (TTT) strategy to address the problem. Our key insight is to enforce the model to predict consistent depth during the TTT process. In detail, we first train a single network to perform both segmentation and depth prediction tasks. This can be effectively learned with our specifically designed depth modulation layer. Then, for the TTT process, the model is updated by predicting consistent depth maps for the same frame under different data augmentations. In addition, we explore different TTT weight updating strategies. Our empirical results suggest that the momentum-based weight initialization and looping-based training scheme lead to more stable improvements. Experiments show that the proposed method achieves clear improvements on ZSVOS. Our proposed video TTT strategy provides significant superiority over state-of-the-art TTT methods. Our code is available at: https://nifangbaage.github.io/DATTT.

Depth-aware Test-Time Training for Zero-shot Video Object Segmentation

TL;DR

This work introduces a test-time training (TTT) strategy to address the problem of zero-shot Video Object Segmentation, and suggests that the momentum-based weight initialization and looping-based training scheme lead to more stable improvements.

Abstract

Zero-shot Video Object Segmentation (ZSVOS) aims at segmenting the primary moving object without any human annotations. Mainstream solutions mainly focus on learning a single model on large-scale video datasets, which struggle to generalize to unseen videos. In this work, we introduce a test-time training (TTT) strategy to address the problem. Our key insight is to enforce the model to predict consistent depth during the TTT process. In detail, we first train a single network to perform both segmentation and depth prediction tasks. This can be effectively learned with our specifically designed depth modulation layer. Then, for the TTT process, the model is updated by predicting consistent depth maps for the same frame under different data augmentations. In addition, we explore different TTT weight updating strategies. Our empirical results suggest that the momentum-based weight initialization and looping-based training scheme lead to more stable improvements. Experiments show that the proposed method achieves clear improvements on ZSVOS. Our proposed video TTT strategy provides significant superiority over state-of-the-art TTT methods. Our code is available at: https://nifangbaage.github.io/DATTT.
Paper Structure (20 sections, 6 equations, 7 figures, 11 tables)

This paper contains 20 sections, 6 equations, 7 figures, 11 tables.

Figures (7)

  • Figure 1: Key idea of our Depth-aware Test-Time Training. During the test-time training, the model is required to predict consistent depth maps for the same video frame under different data augmentation (2nd row). The model is progressively updated and provides more precise mask prediction (3rd row).
  • Figure 2: Overview of the proposed Depth-aware Test-Time Training. We add a depth decoder to commonly used two-stream ZSVOS architecture to learn 3D knowledge. The model is first trained on large-scale datasets for object segmentation and depth estimation. Then, for each test video, we employ photometric distortion-based data augmentation to the frames. The error between the predicted depth maps is backward to update the image encoder. Finally, the new model is applied to infer the object.
  • Figure 3: The proposed depth-aware modulation layer. At each scale $i$, we generate the modulation parameter by the depth feature $\mathcal{D}_{d}^i$ and the object feature $\mathcal{D}_{m}^i$ to modulate $\mathcal{D}_{m}^i$.
  • Figure 4: A glance at different frameworks for ZSVOS described in Section \ref{['sec:depth_aware_ttt']}. (a) The previous ZSVOS methods directly apply the trained model to infer the test video. (b) Image-based test-time training methods (TTT-N) fine-tune the model on each individual frame. (c) Video test-time training by momentum-based weight initialization (TTT-MWI) trains the model based on past models. (d) Video test-time training by looping through the video (TTT-LTV) benefits from the global information.
  • Figure 5: The performance varies with the number of training epochs on FBMS ochs2013segmentation, Long-Videos liang2020video, MCL kim2015spatiotemporal datasets. The proposed strategy (TTT-LTV introduced in Section \ref{['sec:depth_aware_ttt']}) requires less time for the model to adapt to the target video on the three datasets and achieves better results.
  • ...and 2 more figures