Learning and Leveraging World Models in Visual Representation Learning

Quentin Garrido; Mahmoud Assran; Nicolas Ballas; Adrien Bardes; Laurent Najman; Yann LeCun

Learning and Leveraging World Models in Visual Representation Learning

Quentin Garrido, Mahmoud Assran, Nicolas Ballas, Adrien Bardes, Laurent Najman, Yann LeCun

TL;DR

The paper introduces Image World Models (IWM) within the Joint Embedding Predictive Architecture (JEPA) to learn latent-space transformations that go beyond masked image modeling. It identifies conditioning, transformation difficulty, and predictor capacity as key factors for successful IWMs and demonstrates that finetuning the predictor on top of a frozen encoder can match or surpass encoder finetuning with efficiency gains, while enabling multi-task learning. IWMs enable a controllable spectrum of representations from invariant to equivariant, allowing flexible downstream performance on classification and segmentation tasks. The work presents practical guidelines for building reusable world models in visual representation learning and highlights their potential to bridge contrastive and MIM paradigms with efficient adaptation across tasks.

Abstract

Joint-Embedding Predictive Architecture (JEPA) has emerged as a promising self-supervised approach that learns by leveraging a world model. While previously limited to predicting missing parts of an input, we explore how to generalize the JEPA prediction task to a broader set of corruptions. We introduce Image World Models, an approach that goes beyond masked image modeling and learns to predict the effect of global photometric transformations in latent space. We study the recipe of learning performant IWMs and show that it relies on three key aspects: conditioning, prediction difficulty, and capacity. Additionally, we show that the predictive world model learned by IWM can be adapted through finetuning to solve diverse tasks; a fine-tuned IWM world model matches or surpasses the performance of previous self-supervised methods. Finally, we show that learning with an IWM allows one to control the abstraction level of the learned representations, learning invariant representations such as contrastive methods, or equivariant representations such as masked image modelling.

Learning and Leveraging World Models in Visual Representation Learning

TL;DR

Abstract

Paper Structure (35 sections, 2 equations, 16 figures, 14 tables)

This paper contains 35 sections, 2 equations, 16 figures, 14 tables.

Introduction
Related works
Augmentation invariant Self-Supervised Learning
World modeling in visual representation learning
Method
Architecture and nomenclature
Learning an Image World Model for representation learning
Evaluating the quality of the world model
Learning a strong Image World Model
Visualizing predictions.
Leveraging world models for downstream tasks
Predictor finetuning
Multitask predictor tuning
Image World Models enable flexible representations
Conclusion and future perspectives
...and 20 more sections

Figures (16)

Figure 1: Visualisation of predictions in latent space with a learned Image World Model. We apply an action on a source image in latent space and retrieve the nearest neighbour of the predicted representation in a bank of 256 images. We see that IWM is capable of modeling transformations and undo corruptions, showing an understanding of the underlying image transformations. Image from: https://ai.meta.com/blog/yann-lecun-advances-in-ai-research/
Figure 2: Multiple families of methods with related architectures can be distinguished, in which the conditioning or not of their world model is a key distinction. Generative World Models are trained to invert a transformation in input space, leveraging an autoencoder framework. Methods for world modeling and representation learning can be instantiated in this way. Joint Embedding methods get rid of the world model but operate in latent space by encoding what is common between transformed inputs. It is the main class of SSL methods. JEPA World Models can be seen as a more general framework where a world model is trained in latent space. This family has been very successful both in reinforcement learning and in representation learning, and is where Image World Models (IWM) falls.
Figure 3: Finetuning efficiency. When taking into account the number of finetuned parameters, predictor finetuning is significantly more efficient than finetuning the encoder.
Figure 4: While the level of equivariance influences performance in Linear and Predictor finetuning setting, it is hardly correlated to Attentive probing. This suggests that there is a trade-off in terms of the level of abstraction of the representation, and that different evaluation protocols evaluate different properties.
Figure 5: Image World Models allow representation modularity. Different families of methods offer representations with different properties, but IWM allows exploring the whole spectrum.
...and 11 more figures

Learning and Leveraging World Models in Visual Representation Learning

TL;DR

Abstract

Learning and Leveraging World Models in Visual Representation Learning

Authors

TL;DR

Abstract

Table of Contents

Figures (16)