Table of Contents
Fetching ...

EBT-Policy: Energy Unlocks Emergent Physical Reasoning Capabilities

Travis Davies, Yiqi Huang, Alexi Gladstone, Yunxin Liu, Xiang Chen, Heng Ji, Huxian Liu, Luhui Hu

TL;DR

EBT-Policy replaces diffusion-based implicit policies with an energy-based trajectory optimization framework using Energy-Based Transformers (EBTs). It learns a scalar energy $E_ heta$ over multimodal inputs and performs adaptive inference via Langevin-style dynamics, eliminating fixed noise schedules. The approach delivers faster inference (often $2$ steps vs up to $100$ for diffusion) and improved robustness to distribution shifts, with emergent retry behavior observed without explicit retry training. This work advances embodied reasoning in robotics by unifying perception, reasoning, and control under a single energy function and highlighting practical benefits for real-world manipulation.

Abstract

Implicit policies parameterized by generative models, such as Diffusion Policy, have become the standard for policy learning and Vision-Language-Action (VLA) models in robotics. However, these approaches often suffer from high computational cost, exposure bias, and unstable inference dynamics, which lead to divergence under distribution shifts. Energy-Based Models (EBMs) address these issues by learning energy landscapes end-to-end and modeling equilibrium dynamics, offering improved robustness and reduced exposure bias. Yet, policies parameterized by EBMs have historically struggled to scale effectively. Recent work on Energy-Based Transformers (EBTs) demonstrates the scalability of EBMs to high-dimensional spaces, but their potential for solving core challenges in physically embodied models remains underexplored. We introduce a new energy-based architecture, EBT-Policy, that solves core issues in robotic and real-world settings. Across simulated and real-world tasks, EBT-Policy consistently outperforms diffusion-based policies, while requiring less training and inference computation. Remarkably, on some tasks it converges within just two inference steps, a 50x reduction compared to Diffusion Policy's 100. Moreover, EBT-Policy exhibits emergent capabilities not seen in prior models, such as zero-shot recovery from failed action sequences using only behavior cloning and without explicit retry training. By leveraging its scalar energy for uncertainty-aware inference and dynamic compute allocation, EBT-Policy offers a promising path toward robust, generalizable robot behavior under distribution shifts.

EBT-Policy: Energy Unlocks Emergent Physical Reasoning Capabilities

TL;DR

EBT-Policy replaces diffusion-based implicit policies with an energy-based trajectory optimization framework using Energy-Based Transformers (EBTs). It learns a scalar energy over multimodal inputs and performs adaptive inference via Langevin-style dynamics, eliminating fixed noise schedules. The approach delivers faster inference (often steps vs up to for diffusion) and improved robustness to distribution shifts, with emergent retry behavior observed without explicit retry training. This work advances embodied reasoning in robotics by unifying perception, reasoning, and control under a single energy function and highlighting practical benefits for real-world manipulation.

Abstract

Implicit policies parameterized by generative models, such as Diffusion Policy, have become the standard for policy learning and Vision-Language-Action (VLA) models in robotics. However, these approaches often suffer from high computational cost, exposure bias, and unstable inference dynamics, which lead to divergence under distribution shifts. Energy-Based Models (EBMs) address these issues by learning energy landscapes end-to-end and modeling equilibrium dynamics, offering improved robustness and reduced exposure bias. Yet, policies parameterized by EBMs have historically struggled to scale effectively. Recent work on Energy-Based Transformers (EBTs) demonstrates the scalability of EBMs to high-dimensional spaces, but their potential for solving core challenges in physically embodied models remains underexplored. We introduce a new energy-based architecture, EBT-Policy, that solves core issues in robotic and real-world settings. Across simulated and real-world tasks, EBT-Policy consistently outperforms diffusion-based policies, while requiring less training and inference computation. Remarkably, on some tasks it converges within just two inference steps, a 50x reduction compared to Diffusion Policy's 100. Moreover, EBT-Policy exhibits emergent capabilities not seen in prior models, such as zero-shot recovery from failed action sequences using only behavior cloning and without explicit retry training. By leveraging its scalar energy for uncertainty-aware inference and dynamic compute allocation, EBT-Policy offers a promising path toward robust, generalizable robot behavior under distribution shifts.

Paper Structure

This paper contains 22 sections, 8 equations, 6 figures, 4 tables, 2 algorithms.

Figures (6)

  • Figure 1: EBT-Policy Diagram. EBT-Policy functions through searching for a low energy action trajectory in cartesian or joint space ($z$) through energy minimization. Further experiments will also be updated.
  • Figure 2: Explaining Uncertainty Modeling. 12 frames are grouped into three phases: (1) Tool Insertion, (2) Hook Hanging Attempt, and (3) Recovery & Successful Retry. Color bar beneath each frame encodes per-frame energy predicted by the model, where a lower energy indicates a higher certainty in EBT-Policy. Notably, red (Step 7) marks the failure that triggers an EBT-Policy retry, while green (Step 11) marks the successful correction. Together, these steps highlight EBT-Policy’s interpretability and physical reasoning: using energy-based uncertainty to decide whether to continue or retry and how to adjust actions. Explaining Energy Minimization. EBT-Policy receives inputs (RGB frames, robotic proprioception, and language instructions) and assigns an energy to candidate action trajectories. Starting from a noisy initialization, the trajectory is iteratively updated by gradient descent on this energy, yielding starting states to a final executable plan. Optimization terminates when the energy converges to a minimum, as illustrated by the energy-landscape sketch.
  • Figure 3: Demonstrations from tabletop, real-world tasks.
  • Figure 4: Representative tasks in robomimic robomimic.
  • Figure 5: Success Rates During Training. EBT-Policy exhibits rapid performance improvement, reaching $100\%$ success by epoch 30, using just 2 iterations for predicting actions. Diffusion Policy (DP), on the other hand, only reaches a $100\%$ success rate after $90$ epochs, and uses $50$ times more steps than EBT-Policy at inference, demonstrating how EBT-Policy is more efficient than DP during both training and inference.
  • ...and 1 more figures