Table of Contents
Fetching ...

VINO: Video-driven Invariance for Non-contextual Objects via Structural Prior Guided De-contextualization

Seul-Ki Yeom, Marcel Simon, Eunbin Lee, Tae-Ho Kim

TL;DR

This work proposes VINO (Video-driven Invariance for Non-Contextual Objects), a teacher-student framework that learns robust image encoders from dense video by imposing a structural information bottleneck and demonstrates that VINO effectively disentangles foreground from background.

Abstract

Self-supervised learning (SSL) has made rapid progress, yet learned features often over-rely on contextual shortcuts-background textures and co-occurrence statistics. While video provides rich temporal variation, dense in-the-wild streams with strong ego-motion create a co-occurrence trap: foreground objects and background context move coherently, encouraging representations to collapse into scene encoders. To address this, we propose VINO (Video-driven Invariance for Non-Contextual Objects), a teacher-student framework that learns robust image encoders from dense video by imposing a structural information bottleneck. Using a class-agnostic structural prior solely to generate views-not as semantic pseudo-labels-VINO forms an asymmetric distillation problem. The teacher predicts from a foreground-union view with the background suppressed, while the student observes object-conditioned scene views that retain surrounding context but remove competing instances. Matching these targets via masked distillation makes background cues unreliable, pushing the representation toward object-centric invariances. We further enforce temporal object permanence via teacher-anchored cross-time distillation over track-matched objects, and stabilize part-to-whole consistency with mask-guided local views. Through attention visualization and unsupervised object discovery on PASCAL VOC, we demonstrate that VINO effectively disentangles foreground from background. Pretrained on the dense Walking Tours Venice video, VINO achieves 34.8 CorLoc, yielding highly focused, shape-biased representations that substantially outperform prior dense-video and motion-guided SSL baselines.

VINO: Video-driven Invariance for Non-contextual Objects via Structural Prior Guided De-contextualization

TL;DR

This work proposes VINO (Video-driven Invariance for Non-Contextual Objects), a teacher-student framework that learns robust image encoders from dense video by imposing a structural information bottleneck and demonstrates that VINO effectively disentangles foreground from background.

Abstract

Self-supervised learning (SSL) has made rapid progress, yet learned features often over-rely on contextual shortcuts-background textures and co-occurrence statistics. While video provides rich temporal variation, dense in-the-wild streams with strong ego-motion create a co-occurrence trap: foreground objects and background context move coherently, encouraging representations to collapse into scene encoders. To address this, we propose VINO (Video-driven Invariance for Non-Contextual Objects), a teacher-student framework that learns robust image encoders from dense video by imposing a structural information bottleneck. Using a class-agnostic structural prior solely to generate views-not as semantic pseudo-labels-VINO forms an asymmetric distillation problem. The teacher predicts from a foreground-union view with the background suppressed, while the student observes object-conditioned scene views that retain surrounding context but remove competing instances. Matching these targets via masked distillation makes background cues unreliable, pushing the representation toward object-centric invariances. We further enforce temporal object permanence via teacher-anchored cross-time distillation over track-matched objects, and stabilize part-to-whole consistency with mask-guided local views. Through attention visualization and unsupervised object discovery on PASCAL VOC, we demonstrate that VINO effectively disentangles foreground from background. Pretrained on the dense Walking Tours Venice video, VINO achieves 34.8 CorLoc, yielding highly focused, shape-biased representations that substantially outperform prior dense-video and motion-guided SSL baselines.
Paper Structure (34 sections, 13 equations, 3 figures, 2 tables)

This paper contains 34 sections, 13 equations, 3 figures, 2 tables.

Figures (3)

  • Figure 1: Attention maps from ViT-S/16 encoders. We visualize attention maps by ViT-S/16 encoders for the same inputs, comparing DINO trained on ImageNet, DINO trained on WT-Venice, DoRA, and VINO. (a) shows results on single static natural images from PASCAL VOC 2012Everingham2012PascalVOC2012. (b) shows results on Physical AI video sequences from Mobile ALOHA Fu24MobileALOHA dataset, where attention is visualized across multiple frames within each sequence.
  • Figure 2: Our framework learns object-centric representations from dense video by enforcing a structural information bottleneck. (A) The Teacher observes a foreground-union global view where background is suppressed, providing a de-contextualized target. (B) The Student receives object-conditioned views that retain background but remove co-occurring objects using a structural prior. (C) This asymmetric distillation makes background and co-occurrence shortcuts non-predictive, pushing representations toward object-intrinsic cues while retaining robustness to natural context. The total objective $\mathcal{L}_{\text{total}}$ ensures spatial de-contextualization ($\mathcal{L}_{\text{mask}}$), temporal object permanence ($\mathcal{L}_{\text{temp}}$), and part-to-whole consistency ($\mathcal{L}_{\text{local}}$).
  • Figure 3: Unsupervised object discovery on PASCAL VOC 2012. We visualize the predicted object bounding boxes obtained from attention based foreground masks following the default LOSTSimeoni21LOST. We compared results on ViT-S/16 encoders for the same inputs, comparing DINO trained on ImageNet, DINO trained on WT-Venice, DoRA, and VINO. Compared to baselines, VINO produces tighter boxes that better align with the principal object extent and is less prone to drifting toward large background regions, highlighting improved figure--ground separation under dense ego-motion pretraining.