Table of Contents
Fetching ...

Energy-Based Transformers are Scalable Learners and Thinkers

Alexi Gladstone, Ganesh Nanduru, Md Mofijul Islam, Peixuan Han, Hyeonjeong Ha, Aman Chadha, Yilun Du, Heng Ji, Jundong Li, Tariq Iqbal

TL;DR

This work introduces Energy-Based Transformers (EBTs), a class of Energy-Based Models integrated with Transformer architectures to enable System 2 Thinking through explicit verification and gradient-based energy minimization. By learning an energy landscape over input–prediction pairs and training with optimization-based objectives, EBTs achieve faster learning scaling and exhibit stronger thinking capabilities across text, video, and image modalities without external supervision. The approach yields improved out-of-distribution generalization and reduced compute during inference for certain tasks, while also demonstrating enhanced uncertainty modeling through energy landscapes. Collectively, EBTs offer a principled, unsupervised pathway to scalable learning and thinking that could extend to foundation-model-scale problems and multimodal domains.

Abstract

Inference-time computation techniques, analogous to human System 2 Thinking, have recently become popular for improving model performances. However, most existing approaches suffer from several limitations: they are modality-specific (e.g., working only in text), problem-specific (e.g., verifiable domains like math and coding), or require additional supervision/training on top of unsupervised pretraining (e.g., verifiers or verifiable rewards). In this paper, we ask the question "Is it possible to generalize these System 2 Thinking approaches, and develop models that learn to think solely from unsupervised learning?" Interestingly, we find the answer is yes, by learning to explicitly verify the compatibility between inputs and candidate-predictions, and then re-framing prediction problems as optimization with respect to this verifier. Specifically, we train Energy-Based Transformers (EBTs) -- a new class of Energy-Based Models (EBMs) -- to assign an energy value to every input and candidate-prediction pair, enabling predictions through gradient descent-based energy minimization until convergence. Across both discrete (text) and continuous (visual) modalities, we find EBTs scale faster than the dominant Transformer++ approach during training, achieving an up to 35% higher scaling rate with respect to data, batch size, parameters, FLOPs, and depth. During inference, EBTs improve performance with System 2 Thinking by 29% more than the Transformer++ on language tasks, and EBTs outperform Diffusion Transformers on image denoising while using fewer forward passes. Further, we find that EBTs achieve better results than existing models on most downstream tasks given the same or worse pretraining performance, suggesting that EBTs generalize better than existing approaches. Consequently, EBTs are a promising new paradigm for scaling both the learning and thinking capabilities of models.

Energy-Based Transformers are Scalable Learners and Thinkers

TL;DR

This work introduces Energy-Based Transformers (EBTs), a class of Energy-Based Models integrated with Transformer architectures to enable System 2 Thinking through explicit verification and gradient-based energy minimization. By learning an energy landscape over input–prediction pairs and training with optimization-based objectives, EBTs achieve faster learning scaling and exhibit stronger thinking capabilities across text, video, and image modalities without external supervision. The approach yields improved out-of-distribution generalization and reduced compute during inference for certain tasks, while also demonstrating enhanced uncertainty modeling through energy landscapes. Collectively, EBTs offer a principled, unsupervised pathway to scalable learning and thinking that could extend to foundation-model-scale problems and multimodal domains.

Abstract

Inference-time computation techniques, analogous to human System 2 Thinking, have recently become popular for improving model performances. However, most existing approaches suffer from several limitations: they are modality-specific (e.g., working only in text), problem-specific (e.g., verifiable domains like math and coding), or require additional supervision/training on top of unsupervised pretraining (e.g., verifiers or verifiable rewards). In this paper, we ask the question "Is it possible to generalize these System 2 Thinking approaches, and develop models that learn to think solely from unsupervised learning?" Interestingly, we find the answer is yes, by learning to explicitly verify the compatibility between inputs and candidate-predictions, and then re-framing prediction problems as optimization with respect to this verifier. Specifically, we train Energy-Based Transformers (EBTs) -- a new class of Energy-Based Models (EBMs) -- to assign an energy value to every input and candidate-prediction pair, enabling predictions through gradient descent-based energy minimization until convergence. Across both discrete (text) and continuous (visual) modalities, we find EBTs scale faster than the dominant Transformer++ approach during training, achieving an up to 35% higher scaling rate with respect to data, batch size, parameters, FLOPs, and depth. During inference, EBTs improve performance with System 2 Thinking by 29% more than the Transformer++ on language tasks, and EBTs outperform Diffusion Transformers on image denoising while using fewer forward passes. Further, we find that EBTs achieve better results than existing models on most downstream tasks given the same or worse pretraining performance, suggesting that EBTs generalize better than existing approaches. Consequently, EBTs are a promising new paradigm for scaling both the learning and thinking capabilities of models.

Paper Structure

This paper contains 76 sections, 13 equations, 19 figures, 8 tables, 2 algorithms.

Figures (19)

  • Figure 1: Autoregressive Architecture Comparison. (a) Autoregressive (AR) Transformer is the most common, with (b) RNNs becoming more popular recently gu2023mambapeng2023rwkv. (c) Diffusion Transformers peebles2023scalable (which are often bidirectional but can also be autoregressive) are the most similar to EBT, being able to dynamically allocate computation during inference, but predict the noise rather than the energy li2025autoregressivedeng2024causal. Consequently, diffusion models cannot give unnormalized likelihood estimates at each step of the thinking process, and are not trained as explicit verifiers, unlike EBTs.
  • Figure 2: EBT for Autoregressive Modeling. Each blue box corresponds to a different prediction based on the current step of the thinking process, where the initial prediction starts as random. At each step, a new prediction is fed into the model, which gives an energy scalar for the prediction's current compatibility (unnormalized likelihood) with the context (Facets \ref{['facet:prediction-uncertainty']} and \ref{['facet:prediction-verification']}). Then, the gradient of this energy with respect to the prediction is calculated and used to update the prediction. This gradient descent update is done iteratively to refine the prediction until convergence of the predicted energy, which allows for dynamic use of computation (Facet \ref{['facet:dynamic-computation']}).
  • Figure 3: Thinking Process Visualization. A learned energy landscape and its optimization through gradient descent, interpreted as a thinking process. In this example, the model predicts a distribution over text tokens, progressively shifting from an initial random distribution toward the target distribution. At each step, the EBM assigns an energy scalar indicating how compatible the current prediction is with the context, visualized as the landscape's height (Facet \ref{['facet:prediction-verification']}). This scalar's convergence allows the model to determine whether the prediction is adequate or if further thinking is necessary. Uncertainty (Facet \ref{['facet:prediction-uncertainty']}) can be represented by landscapes that are harder to optimize or by landscapes with many local minima, allowing the model to know when it requires more steps to think (Facet \ref{['facet:dynamic-computation']}). Adapted from li2018visualizing.
  • Figure 4: Language Learning Scalability---Data, Batch Size, and Depth. A comparison between the scaling of the Transformer++ recipe touvron2023llama and EBTs across data, batch size, and depth during pretraining. On all of these axes, EBTs out-scale the Transformer++ recipe significantly, indicating improved data efficiency and suggesting potential benefits for generalization. Additionally, the improved depth scaling offers promise for reasoning, where depth is crucial ye2024physics. These results suggest that if these scaling trends persist, EBTs would likely outperform Transformer++ models at foundation model data scale.
  • Figure 5: Language Learning Scalability---Parameters, FLOPs, and Width. Pretraining scaling comparisons between the Transformer++ recipe touvron2023llama and EBTs across model size (parameters), compute (FLOPs), and width (embedding dimension). EBTs slightly out-scale the Transformer++ in FLOP and parameter scaling, becoming the first approach to achieve a higher scaling rate without modifying the tokenizer pagnoni2024byte to our knowledge. These results suggest that EBTs offer high promise as a pretraining paradigm in both parameter and FLOP efficiency as scale increases.
  • ...and 14 more figures

Theorems & Definitions (1)

  • Definition C.1: System 2 Thinking