Table of Contents
Fetching ...

Towards Stable Self-Supervised Object Representations in Unconstrained Egocentric Video

Yuting Tan, Xilong Cheng, Yunxiao Qin, Zhengnan Li, Jingjing Zhang

Abstract

Humans develop visual intelligence through perceiving and interacting with their environment - a self-supervised learning process grounded in egocentric experience. Inspired by this, we ask how can artificial systems learn stable object representations from continuous, uncurated first-person videos without relying on manual annotations. This setting poses challenges of separating, recognizing, and persistently tracking objects amid clutter, occlusion, and ego-motion. We propose EgoViT, a unified vision Transformer framework designed to learn stable object representations from unlabeled egocentric video. EgoViT bootstraps this learning process by jointly discovering and stabilizing "proto-objects" through three synergistic mechanisms: (1) Proto-object Learning, which uses intra-frame distillation to form discriminative representations; (2) Depth Regularization, which grounds these representations in geometric structure; and (3) Teacher-Filtered Temporal Consistency, which enforces identity over time. This creates a virtuous cycle where initial object hypotheses are progressively refined into stable, persistent representations. The framework is trained end-to-end on unlabeled first-person videos and exhibits robustness to geometric priors of varied origin and quality. On standard benchmarks, EgoViT achieves +8.0% CorLoc improvement in unsupervised object discovery and +4.8% mIoU improvement in semantic segmentation, demonstrating its potential to lay a foundation for robust visual abstraction in embodied intelligence.

Towards Stable Self-Supervised Object Representations in Unconstrained Egocentric Video

Abstract

Humans develop visual intelligence through perceiving and interacting with their environment - a self-supervised learning process grounded in egocentric experience. Inspired by this, we ask how can artificial systems learn stable object representations from continuous, uncurated first-person videos without relying on manual annotations. This setting poses challenges of separating, recognizing, and persistently tracking objects amid clutter, occlusion, and ego-motion. We propose EgoViT, a unified vision Transformer framework designed to learn stable object representations from unlabeled egocentric video. EgoViT bootstraps this learning process by jointly discovering and stabilizing "proto-objects" through three synergistic mechanisms: (1) Proto-object Learning, which uses intra-frame distillation to form discriminative representations; (2) Depth Regularization, which grounds these representations in geometric structure; and (3) Teacher-Filtered Temporal Consistency, which enforces identity over time. This creates a virtuous cycle where initial object hypotheses are progressively refined into stable, persistent representations. The framework is trained end-to-end on unlabeled first-person videos and exhibits robustness to geometric priors of varied origin and quality. On standard benchmarks, EgoViT achieves +8.0% CorLoc improvement in unsupervised object discovery and +4.8% mIoU improvement in semantic segmentation, demonstrating its potential to lay a foundation for robust visual abstraction in embodied intelligence.
Paper Structure (101 sections, 21 equations, 14 figures, 15 tables, 1 algorithm)

This paper contains 101 sections, 21 equations, 14 figures, 15 tables, 1 algorithm.

Figures (14)

  • Figure 1: Visual data complexity comparison. (a) ImageNet deng2009imagenetand (b) Kinetics-400 kay2017kinetics feature predominantly object-centric scenes with clear backgrounds or structured interactions. (c) Unconstrained egocentric videos venkataramanan2023imagenet present substantially greater complexity, featuring dense object interactions, severe occlusions, and continuous ego-motion, which together pose unique challenges for learning persistent object representations.
  • Figure 2: EgoViT adopts a Teacher-Student architecture, processing input frames $\{X^{t}\}_{t=1}^{T}$ and $\{P_n^t\}^{T,N}_{t=1,n=1}$. Student $g_\theta$ learns from three mechanism: (1) depth-regularization $\mathcal{L}_{\text{depth}}$; (2) proto-object learning $\mathcal{L}_{\text{proto}}$ ; (3) teacher-filtered temporal consistency $\mathcal{L}_{\text{temp}}$. The teacher network $g_{\theta'}$ is updated using EMA.
  • Figure 3: Proto-object Delineation via Teacher Attention.
  • Figure 4: (a) Depth Regularization: An auxiliary task, $\mathcal{L}_\text{depth}$, provides geometric constraint. (b) Proto-Object Learning: A distillation loss, $\mathcal{L}_\text{proto}$, aligns student and teacher features in proto-level. Here, $H(y', y)$ denotes the cross-entropy between the softmax outputs of a teacher target $y'$ and a student prediction $y$. (c) Teacher-Filtered Temporal Consistency: A contrastive loss $\mathcal{L}_\text{temp}$ is applied on reliable pairs filtered by the teacher to enforce temporal identity.
  • Figure 5: EgoViT achieves superior temporal attention stability. Compared to baselines (DINO, DoRA) that exhibit significant attention drift over time, our method maintains a coherent focus on the target object across the sequence, even through severe occlusion (see red circles for failure cases).
  • ...and 9 more figures