Transformers as Intrinsic Optimizers: Forward Inference through the Energy Principle

Ruifeng Ren; Sheng Ouyang; Huayi Tang; Yong Liu

Transformers as Intrinsic Optimizers: Forward Inference through the Energy Principle

Ruifeng Ren, Sheng Ouyang, Huayi Tang, Yong Liu

TL;DR

The paper reframes Transformer attention as an energy-based optimization problem by mapping softmax attention to minimizing Helmholtz free energy $F^*$ with per-token energy $E_i$. It generalizes to linear and multi-head attentions within this framework and derives GD-based variants, including momentum-based (${ m MomenMHA}$) and Newton-type (${ m MHA2nd1st}$) attentions, to accelerate inference. Empirical results on MiniPile with GPT-like models show faster convergence for momentum-based variants and competitive performance for light Newton variants, supporting the practicality of energy-guided attention design. The framework provides a principled blueprint for designing new attention mechanisms and connects to broader themes like Loop Transformers and test-time optimization, suggesting fertile future directions for efficient, theory-informed architectures.

Abstract

Transformers have demonstrated strong adaptability across a wide range of tasks and have become the backbone of modern Large Language Models (LLMs). However, their underlying mechanisms remain open for further exploration. The energy-based perspective has long provided a valuable principle for understanding neural computation. In this paper, we revisit the principle of energy as a lens to understand attention-based Transformer models. We present a unified energy-based framework which is composed of three key components: the global energy $F^*$, the energy function $E_i$ and the employed gradient descent (GD) form. Within this framework, standard softmax attention can be viewed as a special case of minimizing the Helmholtz free energy as $F^*$ using standard GD when $E_i$ takes the form of elastic potential energy, with residual connections ensuring that this optimization proceeds in an incremental manner. In addition, linear attentions can also be naturally incorporated into this framework by adjusting the corresponding energy forms. We also extend the above analysis to the multi-head setting, where the energy is defined across multiple low-dimensional subspaces. Building on this framework, we propose energy-based modifications of attention structures. Inspired by classical GD algorithms, we extend the original attention formulation based on standard GD to the momentum-based GD, Nesterov Accelerated Gradient (NAG), and Newton's method variants, each inducing a corresponding new attention structure. Our experiments provide preliminary support for the potential of the energy-based framework for designing attention mechanisms.

Transformers as Intrinsic Optimizers: Forward Inference through the Energy Principle

TL;DR

The paper reframes Transformer attention as an energy-based optimization problem by mapping softmax attention to minimizing Helmholtz free energy

with per-token energy

. It generalizes to linear and multi-head attentions within this framework and derives GD-based variants, including momentum-based (

) and Newton-type (

) attentions, to accelerate inference. Empirical results on MiniPile with GPT-like models show faster convergence for momentum-based variants and competitive performance for light Newton variants, supporting the practicality of energy-guided attention design. The framework provides a principled blueprint for designing new attention mechanisms and connects to broader themes like Loop Transformers and test-time optimization, suggesting fertile future directions for efficient, theory-informed architectures.

Abstract

, the energy function

and the employed gradient descent (GD) form. Within this framework, standard softmax attention can be viewed as a special case of minimizing the Helmholtz free energy as

using standard GD when

takes the form of elastic potential energy, with residual connections ensuring that this optimization proceeds in an incremental manner. In addition, linear attentions can also be naturally incorporated into this framework by adjusting the corresponding energy forms. We also extend the above analysis to the multi-head setting, where the energy is defined across multiple low-dimensional subspaces. Building on this framework, we propose energy-based modifications of attention structures. Inspired by classical GD algorithms, we extend the original attention formulation based on standard GD to the momentum-based GD, Nesterov Accelerated Gradient (NAG), and Newton's method variants, each inducing a corresponding new attention structure. Our experiments provide preliminary support for the potential of the energy-based framework for designing attention mechanisms.

Transformers as Intrinsic Optimizers: Forward Inference through the Energy Principle

TL;DR

Abstract

Transformers as Intrinsic Optimizers: Forward Inference through the Energy Principle

TL;DR

Abstract

Paper Structure

Table of Contents

Key Result

Figures (1)

Theorems & Definitions (17)