Table of Contents
Fetching ...

Introduction to Latent Variable Energy-Based Models: A Path Towards Autonomous Machine Intelligence

Anna Dawid, Yann LeCun

TL;DR

The notes argue that current ML struggles with data efficiency and robust world modeling, hindering human-like autonomous intelligence. They advocate latent-variable energy-based models and the hierarchical JEPA/H-JEPA framework as a scalable path to predictive world models and hierarchical planning, trained with regularized objectives rather than solely supervised or reinforcement learning signals. The paper surveys energy-based models, training strategies (contrastive and regularized), and classic examples, illustrating how JEPA/H-JEPA can handle multimodal data and uncertainty to enable autonomous decision-making. If realized, this approach could yield more sample-efficient, reasoning-capable systems with broad impact on autonomous driving, robotics, translation, and scientific modeling.

Abstract

Current automated systems have crucial limitations that need to be addressed before artificial intelligence can reach human-like levels and bring new technological revolutions. Among others, our societies still lack Level 5 self-driving cars, domestic robots, and virtual assistants that learn reliable world models, reason, and plan complex action sequences. In these notes, we summarize the main ideas behind the architecture of autonomous intelligence of the future proposed by Yann LeCun. In particular, we introduce energy-based and latent variable models and combine their advantages in the building block of LeCun's proposal, that is, in the hierarchical joint embedding predictive architecture (H-JEPA).

Introduction to Latent Variable Energy-Based Models: A Path Towards Autonomous Machine Intelligence

TL;DR

The notes argue that current ML struggles with data efficiency and robust world modeling, hindering human-like autonomous intelligence. They advocate latent-variable energy-based models and the hierarchical JEPA/H-JEPA framework as a scalable path to predictive world models and hierarchical planning, trained with regularized objectives rather than solely supervised or reinforcement learning signals. The paper surveys energy-based models, training strategies (contrastive and regularized), and classic examples, illustrating how JEPA/H-JEPA can handle multimodal data and uncertainty to enable autonomous decision-making. If realized, this approach could yield more sample-efficient, reasoning-capable systems with broad impact on autonomous driving, robotics, translation, and scientific modeling.

Abstract

Current automated systems have crucial limitations that need to be addressed before artificial intelligence can reach human-like levels and bring new technological revolutions. Among others, our societies still lack Level 5 self-driving cars, domestic robots, and virtual assistants that learn reliable world models, reason, and plan complex action sequences. In these notes, we summarize the main ideas behind the architecture of autonomous intelligence of the future proposed by Yann LeCun. In particular, we introduce energy-based and latent variable models and combine their advantages in the building block of LeCun's proposal, that is, in the hierarchical joint embedding predictive architecture (H-JEPA).
Paper Structure (35 sections, 14 equations, 11 figures, 1 table)

This paper contains 35 sections, 14 equations, 11 figures, 1 table.

Figures (11)

  • Figure 1: The modular structure of an autonomous AI proposed by LeCun in Ref. LeCunnPathTowardsAI. Drawings generated by DALL-E 2 DALLE.
  • Figure 2: SSL. (a) In SSL, the system is trained to predict hidden parts of the input (in orange) from visible parts of the input (in blue). (b) SSL will play a central role in the future AI systems. SSL is the cake (provides millions of information bits per sample), SL is the icing (10-10,000 bits per sample), RL is the cherry on top (a few bits of information for some samples). A cake image was generated by DALL-E 2 DALLE.
  • Figure 3: Towards EBM. (a) To achieve multimodal predictions in high dimensions, we can replace probabilistic models with EBM. Then, instead of minimizing the divergence measure between the prediction and the target, we look for $y$'s that satisfy a set of constraints posed by $x$, expressed as the energy function, $F(x,y)$. A trained EBM should assign low energies to $y$'s that are a good continuation of $x$ (in case of video or text) or that are compatible or similar (in case of images of an object taken from different angles). (b) Exemplary energy function capturing the dependence of $x$ and $y$ (which is $y = x^2$) from the training set, represented as blue points. The applied architecture is presented in Fig. \ref{['fig:EBM-collapse']}(a). Note that the energy function is not unique given only the training data set!
  • Figure 4: Latent variable EBM. (a) Inference in latent variable EBM additionally includes the minimization (or marginalization) with respect to the latent variable. (b) An example of the latent variable EBM in the problem of finding the distance of a green point $y$ from an ellipse learned from the training points depicted as blue dots. The latent variable here encoded the angle at which the closest to $y$ point on the ellipse lies.
  • Figure 5: EBM can collapse. (a) Standard deterministic architecture for prediction or regression, where energy function $F_{\bm{w}} (x, y)$ is the distance between the NN prediction for $x$ and the $y$ itself, is immune to collapse. (b) An example of an EBM that can collapse.
  • ...and 6 more figures