EBT-Policy: Energy Unlocks Emergent Physical Reasoning Capabilities
Travis Davies, Yiqi Huang, Alexi Gladstone, Yunxin Liu, Xiang Chen, Heng Ji, Huxian Liu, Luhui Hu
TL;DR
EBT-Policy replaces diffusion-based implicit policies with an energy-based trajectory optimization framework using Energy-Based Transformers (EBTs). It learns a scalar energy $E_ heta$ over multimodal inputs and performs adaptive inference via Langevin-style dynamics, eliminating fixed noise schedules. The approach delivers faster inference (often $2$ steps vs up to $100$ for diffusion) and improved robustness to distribution shifts, with emergent retry behavior observed without explicit retry training. This work advances embodied reasoning in robotics by unifying perception, reasoning, and control under a single energy function and highlighting practical benefits for real-world manipulation.
Abstract
Implicit policies parameterized by generative models, such as Diffusion Policy, have become the standard for policy learning and Vision-Language-Action (VLA) models in robotics. However, these approaches often suffer from high computational cost, exposure bias, and unstable inference dynamics, which lead to divergence under distribution shifts. Energy-Based Models (EBMs) address these issues by learning energy landscapes end-to-end and modeling equilibrium dynamics, offering improved robustness and reduced exposure bias. Yet, policies parameterized by EBMs have historically struggled to scale effectively. Recent work on Energy-Based Transformers (EBTs) demonstrates the scalability of EBMs to high-dimensional spaces, but their potential for solving core challenges in physically embodied models remains underexplored. We introduce a new energy-based architecture, EBT-Policy, that solves core issues in robotic and real-world settings. Across simulated and real-world tasks, EBT-Policy consistently outperforms diffusion-based policies, while requiring less training and inference computation. Remarkably, on some tasks it converges within just two inference steps, a 50x reduction compared to Diffusion Policy's 100. Moreover, EBT-Policy exhibits emergent capabilities not seen in prior models, such as zero-shot recovery from failed action sequences using only behavior cloning and without explicit retry training. By leveraging its scalar energy for uncertainty-aware inference and dynamic compute allocation, EBT-Policy offers a promising path toward robust, generalizable robot behavior under distribution shifts.
