Table of Contents
Fetching ...

FLARE: Robot Learning with Implicit World Modeling

Ruijie Zheng, Jing Wang, Scott Reed, Johan Bjorck, Yu Fang, Fengyuan Hu, Joel Jang, Kaushil Kundalia, Zongyu Lin, Loic Magne, Avnish Narayan, You Liang Tan, Guanzhi Wang, Qi Wang, Jiannan Xiang, Yinzhen Xu, Seonghyeon Ye, Jan Kautz, Furong Huang, Yuke Zhu, Linxi Fan

TL;DR

FLARE introduces a lightweight latent world modeling paradigm for robot learning by aligning future latent representations with current action-conditioned states in a diffusion-transformer policy. It adds learnable future tokens to the model, trained with a cosine-alignment loss to future observation embeddings while preserving action denoising through flow-matching. The approach achieves state-of-the-art results on RoboCasa and GR1 multitask benchmarks, enables data-efficient post-training with cross-embodiment pretrained embeddings, and can leverage unlabeled human egocentric videos to improve generalization to novel objects. This work demonstrates that implicit future-state reasoning within a VLA-extended diffusion framework can substantially enhance robotic manipulation performance with minimal architectural changes.

Abstract

We introduce $\textbf{F}$uture $\textbf{LA}$tent $\textbf{RE}$presentation Alignment ($\textbf{FLARE}$), a novel framework that integrates predictive latent world modeling into robot policy learning. By aligning features from a diffusion transformer with latent embeddings of future observations, $\textbf{FLARE}$ enables a diffusion transformer policy to anticipate latent representations of future observations, allowing it to reason about long-term consequences while generating actions. Remarkably lightweight, $\textbf{FLARE}$ requires only minimal architectural modifications -- adding a few tokens to standard vision-language-action (VLA) models -- yet delivers substantial performance gains. Across two challenging multitask simulation imitation learning benchmarks spanning single-arm and humanoid tabletop manipulation, $\textbf{FLARE}$ achieves state-of-the-art performance, outperforming prior policy learning baselines by up to 26%. Moreover, $\textbf{FLARE}$ unlocks the ability to co-train with human egocentric video demonstrations without action labels, significantly boosting policy generalization to a novel object with unseen geometry with as few as a single robot demonstration. Our results establish $\textbf{FLARE}$ as a general and scalable approach for combining implicit world modeling with high-frequency robotic control.

FLARE: Robot Learning with Implicit World Modeling

TL;DR

FLARE introduces a lightweight latent world modeling paradigm for robot learning by aligning future latent representations with current action-conditioned states in a diffusion-transformer policy. It adds learnable future tokens to the model, trained with a cosine-alignment loss to future observation embeddings while preserving action denoising through flow-matching. The approach achieves state-of-the-art results on RoboCasa and GR1 multitask benchmarks, enables data-efficient post-training with cross-embodiment pretrained embeddings, and can leverage unlabeled human egocentric videos to improve generalization to novel objects. This work demonstrates that implicit future-state reasoning within a VLA-extended diffusion framework can substantially enhance robotic manipulation performance with minimal architectural changes.

Abstract

We introduce uture tent presentation Alignment (), a novel framework that integrates predictive latent world modeling into robot policy learning. By aligning features from a diffusion transformer with latent embeddings of future observations, enables a diffusion transformer policy to anticipate latent representations of future observations, allowing it to reason about long-term consequences while generating actions. Remarkably lightweight, requires only minimal architectural modifications -- adding a few tokens to standard vision-language-action (VLA) models -- yet delivers substantial performance gains. Across two challenging multitask simulation imitation learning benchmarks spanning single-arm and humanoid tabletop manipulation, achieves state-of-the-art performance, outperforming prior policy learning baselines by up to 26%. Moreover, unlocks the ability to co-train with human egocentric video demonstrations without action labels, significantly boosting policy generalization to a novel object with unseen geometry with as few as a single robot demonstration. Our results establish as a general and scalable approach for combining implicit world modeling with high-frequency robotic control.

Paper Structure

This paper contains 20 sections, 6 equations, 13 figures, 3 tables, 1 algorithm.

Figures (13)

  • Figure 1: Comparison of FLARE to a conventional flow-matching (or diffusion) policy. FLARE can train using both action flow-matching and future latent alignment objectives, leading to improved performance as well as enabling learning from video-only data such as human ego-view demonstrations.
  • Figure 2: FLARE architecture. State and action token embeddings are concatenated into a sequence with learnable future token embeddings. The flow matching DiT blocks perform self-attention on this sequence, and cross-attention to the current vision and text observation embeddings. At a middle layer, the activations corresponding to the future token embeddings are used to compute a future latent alignment loss, which is the cosine similarity with vision-language embeddings from a future observation.
  • Figure 2: Ablation of target embedding models.
  • Figure 3: Data mixture of pretrained action-aware vision language embedding model
  • Figure 4: Multitask Simulation Benchmarks: We use 24 RoboCasa robocasa2024 and 24 GR-1 tabletop manipulation tasks as a multitask simulation benchmark suite in this paper.
  • ...and 8 more figures