Table of Contents
Fetching ...

Rethinking Image-to-Video Adaptation: An Object-centric Perspective

Rui Qian, Shuangrui Ding, Dahua Lin

TL;DR

This work tackles efficient transfer of image foundation models to the video domain by adopting an object-centric perspective. It freezes the image encoder, decomposes each frame into a small set of object tokens via slot attention, and models temporal dynamics with object-time interaction that tracks state changes of individual objects. Two object-level losses—contrastive object distillation and a temporal state-change loss—supervise learning without object annotations, enabling strong action recognition performance with only a fraction of tunable parameters and robust zero-shot video object segmentation. The results demonstrate improved efficiency, interpretability, and data efficiency by operating on compressed object-centric representations rather than dense frame features.

Abstract

Image-to-video adaptation seeks to efficiently adapt image models for use in the video domain. Instead of finetuning the entire image backbone, many image-to-video adaptation paradigms use lightweight adapters for temporal modeling on top of the spatial module. However, these attempts are subject to limitations in efficiency and interpretability. In this paper, we propose a novel and efficient image-to-video adaptation strategy from the object-centric perspective. Inspired by human perception, which identifies objects as key components for video understanding, we integrate a proxy task of object discovery into image-to-video transfer learning. Specifically, we adopt slot attention with learnable queries to distill each frame into a compact set of object tokens. These object-centric tokens are then processed through object-time interaction layers to model object state changes across time. Integrated with two novel object-level losses, we demonstrate the feasibility of performing efficient temporal reasoning solely on the compressed object-centric representations for video downstream tasks. Our method achieves state-of-the-art performance with fewer tunable parameters, only 5\% of fully finetuned models and 50\% of efficient tuning methods, on action recognition benchmarks. In addition, our model performs favorably in zero-shot video object segmentation without further retraining or object annotations, proving the effectiveness of object-centric video understanding.

Rethinking Image-to-Video Adaptation: An Object-centric Perspective

TL;DR

This work tackles efficient transfer of image foundation models to the video domain by adopting an object-centric perspective. It freezes the image encoder, decomposes each frame into a small set of object tokens via slot attention, and models temporal dynamics with object-time interaction that tracks state changes of individual objects. Two object-level losses—contrastive object distillation and a temporal state-change loss—supervise learning without object annotations, enabling strong action recognition performance with only a fraction of tunable parameters and robust zero-shot video object segmentation. The results demonstrate improved efficiency, interpretability, and data efficiency by operating on compressed object-centric representations rather than dense frame features.

Abstract

Image-to-video adaptation seeks to efficiently adapt image models for use in the video domain. Instead of finetuning the entire image backbone, many image-to-video adaptation paradigms use lightweight adapters for temporal modeling on top of the spatial module. However, these attempts are subject to limitations in efficiency and interpretability. In this paper, we propose a novel and efficient image-to-video adaptation strategy from the object-centric perspective. Inspired by human perception, which identifies objects as key components for video understanding, we integrate a proxy task of object discovery into image-to-video transfer learning. Specifically, we adopt slot attention with learnable queries to distill each frame into a compact set of object tokens. These object-centric tokens are then processed through object-time interaction layers to model object state changes across time. Integrated with two novel object-level losses, we demonstrate the feasibility of performing efficient temporal reasoning solely on the compressed object-centric representations for video downstream tasks. Our method achieves state-of-the-art performance with fewer tunable parameters, only 5\% of fully finetuned models and 50\% of efficient tuning methods, on action recognition benchmarks. In addition, our model performs favorably in zero-shot video object segmentation without further retraining or object annotations, proving the effectiveness of object-centric video understanding.
Paper Structure (14 sections, 10 equations, 8 figures, 12 tables)

This paper contains 14 sections, 10 equations, 8 figures, 12 tables.

Figures (8)

  • Figure 1: Illustration of our proposed object-centric video understanding pipeline for image-to-video adaptation. Specifically, we first parse each frame into several object components (represented in different colors) to form an interaction space. Then we respectively establish inter-object interactions within each frame (represented by dotted arrows), and temporal state changes of individual objects (depicted by solid arrows).
  • Figure 2: An overview of our object-centric image-to-video adaptation framework. We use a frozen image pre-trained model to extract frame-wise features and pass them through a lightweight temporal fusion block. Then we employ slot attention with learnable queries to decompose each frame into a compact set of object tokens. Thereafter, we develop object-time interaction layers to establish inter-object interactions and build temporal state changes of individual objects, which are then pooled into the video-level feature for action recognition.
  • Figure 3: Visualization of object decomposition in the form of segmentation. Each column presents three frames in a video. The same color denotes the objects identified by the same object token.
  • Figure 4: Comparison on training data efficiency. Our method demonstrates advantages especially with a small amount of training data.
  • Figure 5: Visualization of the slot attention maps of all $N=8$ object tokens. We present the results without binarization.
  • ...and 3 more figures