PooDLe: Pooled and dense self-supervised learning from naturalistic videos

Alex N. Wang; Christopher Hoang; Yuwen Xiong; Yann LeCun; Mengye Ren

PooDLe: Pooled and dense self-supervised learning from naturalistic videos

Alex N. Wang, Christopher Hoang, Yuwen Xiong, Yann LeCun, Mengye Ren

TL;DR

PooDLe tackles self-supervised learning from naturalistic, dense video by unifying a dense, flow-equivariant objective with a pooled subcrop objective, augmented by a Spatial Decoder Module to preserve small objects. The approach uses flow-informed cropping to generate aligned subcrops and enforces invariance across both dense feature maps and pooled, region-level representations. Empirical results on BDD100K and Walking Tours show state-of-the-art semantic segmentation and object detection performance, with notable gains on small and rare classes and solid transfer to Cityscapes and ADE20K. The study also analyzes cropping strategies and temporal sampling, providing practical guidelines for video SSL design and highlighting the importance of multi-scale integration in dense naturalistic data.

Abstract

Self-supervised learning has driven significant progress in learning from single-subject, iconic images. However, there are still unanswered questions about the use of minimally-curated, naturalistic video data, which contain dense scenes with many independent objects, imbalanced class distributions, and varying object sizes. In this paper, we propose PooDLe, a self-supervised learning method that combines an invariance-based objective on pooled representations with a dense SSL objective that enforces equivariance to optical flow warping. Our results show that a unified objective applied at multiple feature scales is essential for learning effective image representations from naturalistic videos. We validate our method with experiments on the BDD100K driving video dataset and the Walking Tours first-person video dataset, demonstrating its ability to capture spatial understanding from a dense objective and semantic understanding via a pooled representation objective.

PooDLe: Pooled and dense self-supervised learning from naturalistic videos

TL;DR

Abstract

Paper Structure (47 sections, 3 equations, 20 figures, 11 tables)

This paper contains 47 sections, 3 equations, 20 figures, 11 tables.

Introduction
Related Work
Self-supervised learning with iconic images.
Training using dense multi-subject images.
Learning image representations from video data.
PooDLe: Pooled and Dense Learning from naturalistic videos
Preliminaries.
Dense SSL with flow equivariance.
Pooled objective with flow-informed subcrops.
Spatial Decoder Module (SDM).
Experiments
Experiment Setup
Pretraining datasets.
Technical details.
Baselines.
...and 32 more sections

Figures (20)

Figure 1: Iconic image
Figure 2: Dense scene with global crops and subcrops
Figure 3: Class distribution
Figure 5: PooDLe, a self-supervised learning method that combines pooled and dense objectives. Green path: dense objective performing flow-equivariance learning on the output of the decoder $g(\cdot)$. Orange path: pooled objective encoding $K$ subcrops sampled with flow-informed cropping. Projector modules are not shown. Offline weights $\xi$ are the exponential moving average of online weights $\theta$.
Figure 5: Choice of subcrop area on small, large and all classes.
...and 15 more figures

PooDLe: Pooled and dense self-supervised learning from naturalistic videos

TL;DR

Abstract

PooDLe: Pooled and dense self-supervised learning from naturalistic videos

Authors

TL;DR

Abstract

Table of Contents

Figures (20)