Table of Contents
Fetching ...

Cognitively Inspired Energy-Based World Models

Alexi Gladstone, Ganesh Nanduru, Md Mofijul Islam, Aman Chadha, Jundong Li, Tariq Iqbal

TL;DR

The paper tackles the gap between traditional autoregressive world models and human-like cognition by introducing Energy-Based World Models (EBWM) that leverage an energy function to evaluate the compatibility of context with predicted futures. It presents the Energy-Based Transformer (EBT) as a domain-agnostic autoregressive architecture for EBMs and demonstrates that EBWM can achieve competitive data and compute efficiency in computer vision, with promising scaling in NLP. The approach enables four cognitive facets—internal-state shaping, prediction evaluation, dynamic computation, and uncertainty handling—via predictions in input space and energy-based evaluation, enabling a form of System 2 thinking and intelligent state-space search. The work provides detailed design and ablation analyses, shows scalability advantages, and discusses limitations and future directions, including broader domain application and larger-scale testing.

Abstract

One of the predominant methods for training world models is autoregressive prediction in the output space of the next element of a sequence. In Natural Language Processing (NLP), this takes the form of Large Language Models (LLMs) predicting the next token; in Computer Vision (CV), this takes the form of autoregressive models predicting the next frame/token/pixel. However, this approach differs from human cognition in several respects. First, human predictions about the future actively influence internal cognitive processes. Second, humans naturally evaluate the plausibility of predictions regarding future states. Based on this capability, and third, by assessing when predictions are sufficient, humans allocate a dynamic amount of time to make a prediction. This adaptive process is analogous to System 2 thinking in psychology. All these capabilities are fundamental to the success of humans at high-level reasoning and planning. Therefore, to address the limitations of traditional autoregressive models lacking these human-like capabilities, we introduce Energy-Based World Models (EBWM). EBWM involves training an Energy-Based Model (EBM) to predict the compatibility of a given context and a predicted future state. In doing so, EBWM enables models to achieve all three facets of human cognition described. Moreover, we developed a variant of the traditional autoregressive transformer tailored for Energy-Based models, termed the Energy-Based Transformer (EBT). Our results demonstrate that EBWM scales better with data and GPU Hours than traditional autoregressive transformers in CV, and that EBWM offers promising early scaling in NLP. Consequently, this approach offers an exciting path toward training future models capable of System 2 thinking and intelligently searching across state spaces.

Cognitively Inspired Energy-Based World Models

TL;DR

The paper tackles the gap between traditional autoregressive world models and human-like cognition by introducing Energy-Based World Models (EBWM) that leverage an energy function to evaluate the compatibility of context with predicted futures. It presents the Energy-Based Transformer (EBT) as a domain-agnostic autoregressive architecture for EBMs and demonstrates that EBWM can achieve competitive data and compute efficiency in computer vision, with promising scaling in NLP. The approach enables four cognitive facets—internal-state shaping, prediction evaluation, dynamic computation, and uncertainty handling—via predictions in input space and energy-based evaluation, enabling a form of System 2 thinking and intelligent state-space search. The work provides detailed design and ablation analyses, shows scalability advantages, and discusses limitations and future directions, including broader domain application and larger-scale testing.

Abstract

One of the predominant methods for training world models is autoregressive prediction in the output space of the next element of a sequence. In Natural Language Processing (NLP), this takes the form of Large Language Models (LLMs) predicting the next token; in Computer Vision (CV), this takes the form of autoregressive models predicting the next frame/token/pixel. However, this approach differs from human cognition in several respects. First, human predictions about the future actively influence internal cognitive processes. Second, humans naturally evaluate the plausibility of predictions regarding future states. Based on this capability, and third, by assessing when predictions are sufficient, humans allocate a dynamic amount of time to make a prediction. This adaptive process is analogous to System 2 thinking in psychology. All these capabilities are fundamental to the success of humans at high-level reasoning and planning. Therefore, to address the limitations of traditional autoregressive models lacking these human-like capabilities, we introduce Energy-Based World Models (EBWM). EBWM involves training an Energy-Based Model (EBM) to predict the compatibility of a given context and a predicted future state. In doing so, EBWM enables models to achieve all three facets of human cognition described. Moreover, we developed a variant of the traditional autoregressive transformer tailored for Energy-Based models, termed the Energy-Based Transformer (EBT). Our results demonstrate that EBWM scales better with data and GPU Hours than traditional autoregressive transformers in CV, and that EBWM offers promising early scaling in NLP. Consequently, this approach offers an exciting path toward training future models capable of System 2 thinking and intelligently searching across state spaces.
Paper Structure (41 sections, 15 equations, 6 figures, 4 tables)

This paper contains 41 sections, 15 equations, 6 figures, 4 tables.

Figures (6)

  • Figure 1: Comparison of traditional autoregressive approaches and EBWM. (a) Autoregressive (AR) Transformer is the most common, and what we compare scaling results to in this paper, with (b) RNN becoming more popular within the past year gu2023mambapeng2023rwkv. (c) is the most similar to EBWM, being able to allocate a dynamic amount of computation during inference, but predicts the noise rather than the energy höppe2022diffusion. Several architectures are interchangeable, such as RetNet for Transformers sun2023retentive, or Mamba/LSTM gu2023mambahochreiter1997long for RNNs. EBWM, being an Energy-Based Model (EBM), has all four cognitive capabilities described. Trans. stands for Transformer.
  • Figure 2: Architecture of EBWM trained models for natural language processing and computer vision. MCMC is used to iteratively refine on the prediction of a future state until convergence of the predicted energy. Each yellow box corresponds to a different "predicted future state" based off of the current MCMC iteration (the first condition in this case is random).
  • Figure 3: The reconstruction loss across different scales of data, GPU hours, and FLOPs (lower and farther to the left curves are better). The third datapoint for TAMs was the minimum loss value achieved during training--meaning from then onwards the model overfits the data. Note that the rate the loss decreased for EBWM is high initially due to being conditioned on random noise, whereas a simple copying of the most recent embedding for TAMs would yield a low reconstruction loss due to the nature of videos having high temporal consistency. Across runs, we consistently observe that EBWM is less susceptible to overfitting. The lowest validation loss was achieved by EBWM.
  • Figure 4: The reconstruction loss across different scales of data, GPU hours, and FLOPs (lower and farther to the left curves are better) for the aggregated Kinetics-400 and SSV2 dataset. The results demonstrate the scalability of EBWM when compared to TAMs, achieving better scaling in both data and compute hours.
  • Figure 5: The perplexity across different scales of data, GPU hours, and FLOPs (lower and farther to the left curves are better) for the RedPajama-V2 dataset. The experimental results show the promising scaling of NLP models using EBWM. We believe that with more resources to search over different hyperparameters EBWM could perform significantly better. EBWM used a smaller batch size due to involving more memory, and hence was much slower, resulting in it having less datapoints given our compute limit. *The first two points for the Transformer++ were perplexity values on the training set.
  • ...and 1 more figures