Table of Contents
Fetching ...

Object Concepts Emerge from Motion

Haoqian Liang, Xiaohui Wang, Zhichao Li, Ya Yang, Naiyan Wang

TL;DR

This work introduces a motion-guided, unsupervised framework for learning object-centric visual representations by deriving pseudo instance masks from motion boundaries in unlabeled videos, without camera calibration. It leverages optical flow and simple clustering to generate instance-level labels and trains encoders with a contrastive objective, yielding features that excel at monocular depth estimation, 3D object detection, and 3D occupancy perception, often outperforming supervised and other self-supervised baselines. The approach demonstrates strong generalization to unseen indoor scenes and is complementary to existing foundation-model representations, suggesting motion cues can provide a crucial, scalable abstraction for visual understanding. The findings advocate for integrating biologically inspired, motion-based signals into scalable pretraining pipelines for robust, object-centric vision systems.

Abstract

Object concepts play a foundational role in human visual cognition, enabling perception, memory, and interaction in the physical world. Inspired by findings in developmental neuroscience - where infants are shown to acquire object understanding through observation of motion - we propose a biologically inspired framework for learning object-centric visual representations in an unsupervised manner. Our key insight is that motion boundary serves as a strong signal for object-level grouping, which can be used to derive pseudo instance supervision from raw videos. Concretely, we generate motion-based instance masks using off-the-shelf optical flow and clustering algorithms, and use them to train visual encoders via contrastive learning. Our framework is fully label-free and does not rely on camera calibration, making it scalable to large-scale unstructured video data. We evaluate our approach on three downstream tasks spanning both low-level (monocular depth estimation) and high-level (3D object detection and occupancy prediction) vision. Our models outperform previous supervised and self-supervised baselines and demonstrate strong generalization to unseen scenes. These results suggest that motion-induced object representations offer a compelling alternative to existing vision foundation models, capturing a crucial but overlooked level of abstraction: the visual instance. The corresponding code will be released upon paper acceptance.

Object Concepts Emerge from Motion

TL;DR

This work introduces a motion-guided, unsupervised framework for learning object-centric visual representations by deriving pseudo instance masks from motion boundaries in unlabeled videos, without camera calibration. It leverages optical flow and simple clustering to generate instance-level labels and trains encoders with a contrastive objective, yielding features that excel at monocular depth estimation, 3D object detection, and 3D occupancy perception, often outperforming supervised and other self-supervised baselines. The approach demonstrates strong generalization to unseen indoor scenes and is complementary to existing foundation-model representations, suggesting motion cues can provide a crucial, scalable abstraction for visual understanding. The findings advocate for integrating biologically inspired, motion-based signals into scalable pretraining pipelines for robust, object-centric vision systems.

Abstract

Object concepts play a foundational role in human visual cognition, enabling perception, memory, and interaction in the physical world. Inspired by findings in developmental neuroscience - where infants are shown to acquire object understanding through observation of motion - we propose a biologically inspired framework for learning object-centric visual representations in an unsupervised manner. Our key insight is that motion boundary serves as a strong signal for object-level grouping, which can be used to derive pseudo instance supervision from raw videos. Concretely, we generate motion-based instance masks using off-the-shelf optical flow and clustering algorithms, and use them to train visual encoders via contrastive learning. Our framework is fully label-free and does not rely on camera calibration, making it scalable to large-scale unstructured video data. We evaluate our approach on three downstream tasks spanning both low-level (monocular depth estimation) and high-level (3D object detection and occupancy prediction) vision. Our models outperform previous supervised and self-supervised baselines and demonstrate strong generalization to unseen scenes. These results suggest that motion-induced object representations offer a compelling alternative to existing vision foundation models, capturing a crucial but overlooked level of abstraction: the visual instance. The corresponding code will be released upon paper acceptance.

Paper Structure

This paper contains 22 sections, 5 equations, 5 figures, 6 tables, 1 algorithm.

Figures (5)

  • Figure 1: Pipeline of the proposed method.
  • Figure 2: Examples of the pseudo-label generation results and the output features.
  • Figure 3: Similarity visualization for a set of reference points.
  • Figure 4: Examples of feature maps in out of domain scenes.
  • Figure 5: Examples of the pseudo-label generation results and the output features.