Table of Contents
Fetching ...

DIAL: Decoupling Intent and Action via Latent World Modeling for End-to-End VLA

Yi Chen, Yuying Ge, Hui Zhou, Mingyu Ding, Yixiao Ge, Xihui Liu

Abstract

The development of Vision-Language-Action (VLA) models has been significantly accelerated by pre-trained Vision-Language Models (VLMs). However, most existing end-to-end VLAs treat the VLM primarily as a multimodal encoder, directly mapping vision-language features to low-level actions. This paradigm underutilizes the VLM's potential in high-level decision making and introduces training instability, frequently degrading its rich semantic representations. To address these limitations, we introduce DIAL, a framework bridging high-level decision making and low-level motor execution through a differentiable latent intent bottleneck. Specifically, a VLM-based System-2 performs latent world modeling by synthesizing latent visual foresight within the VLM's native feature space; this foresight explicitly encodes intent and serves as the structural bottleneck. A lightweight System-1 policy then decodes this predicted intent together with the current observation into precise robot actions via latent inverse dynamics. To ensure optimization stability, we employ a two-stage training paradigm: a decoupled warmup phase where System-2 learns to predict latent futures while System-1 learns motor control under ground-truth future guidance within a unified feature space, followed by seamless end-to-end joint optimization. This enables action-aware gradients to refine the VLM backbone in a controlled manner, preserving pre-trained knowledge. Extensive experiments on the RoboCasa GR1 Tabletop benchmark show that DIAL establishes a new state-of-the-art, achieving superior performance with 10x fewer demonstrations than prior methods. Furthermore, by leveraging heterogeneous human demonstrations, DIAL learns physically grounded manipulation priors and exhibits robust zero-shot generalization to unseen objects and novel configurations during real-world deployment on a humanoid robot.

DIAL: Decoupling Intent and Action via Latent World Modeling for End-to-End VLA

Abstract

The development of Vision-Language-Action (VLA) models has been significantly accelerated by pre-trained Vision-Language Models (VLMs). However, most existing end-to-end VLAs treat the VLM primarily as a multimodal encoder, directly mapping vision-language features to low-level actions. This paradigm underutilizes the VLM's potential in high-level decision making and introduces training instability, frequently degrading its rich semantic representations. To address these limitations, we introduce DIAL, a framework bridging high-level decision making and low-level motor execution through a differentiable latent intent bottleneck. Specifically, a VLM-based System-2 performs latent world modeling by synthesizing latent visual foresight within the VLM's native feature space; this foresight explicitly encodes intent and serves as the structural bottleneck. A lightweight System-1 policy then decodes this predicted intent together with the current observation into precise robot actions via latent inverse dynamics. To ensure optimization stability, we employ a two-stage training paradigm: a decoupled warmup phase where System-2 learns to predict latent futures while System-1 learns motor control under ground-truth future guidance within a unified feature space, followed by seamless end-to-end joint optimization. This enables action-aware gradients to refine the VLM backbone in a controlled manner, preserving pre-trained knowledge. Extensive experiments on the RoboCasa GR1 Tabletop benchmark show that DIAL establishes a new state-of-the-art, achieving superior performance with 10x fewer demonstrations than prior methods. Furthermore, by leveraging heterogeneous human demonstrations, DIAL learns physically grounded manipulation priors and exhibits robust zero-shot generalization to unseen objects and novel configurations during real-world deployment on a humanoid robot.

Paper Structure

This paper contains 34 sections, 4 equations, 12 figures.

Figures (12)

  • Figure 1: Overview of the DIAL Framework. DIAL bridges high-level decision making and low-level motor control through a differentiable latent intent bottleneck. (Left) System-2 (VLM) performs latent world modeling (LWM) to synthesize latent visual foresight within its native ViT feature space. This foresight serves as a structural bottleneck to convey the VLM's intent, which System-1 (Policy) then decodes into actions via latent inverse dynamics. A decoupled-to-unified training paradigm ensures stability, leveraging initial alignment in a consistent latent space to facilitate subsequent end-to-end refinement via action-aware gradients. (Right) Powered by this structural grounding, DIAL scales across heterogeneous human-robot data, achieving SOTA performance with $10\times$ higher data efficiency and robust zero-shot generalization to unseen real-world configurations.
  • Figure 2: Comparison of VLA Architectures. (Left) Hierarchical Models decouple reasoning and execution via text or pixels, resulting in non-differentiable gaps and significant deployment latency. (Middle) End-to-End VLAs map multimodal features directly to actions. Even when auxiliary tasks are used, they are typically treated as optional context, which cannot strictly guarantee that actions are grounded in the VLM’s intent. (Right) DIAL (Ours) introduces a differentiable latent bottleneck. By requiring System-1 to bridge the gap between current visual features and System-2’s predicted latent foresight, DIAL ensures that execution is inherently anchored to the VLM’s predictive intent.
  • Figure 3: The Dual-System Architecture of DIAL. Built upon a pre-trained VLM, System-2 (top) synthesizes a latent foresight ($x_t$) from language ($l_t$), current visual observation ($o_t$), and learnable queries via its LLM backbone and an MLP head. System-1 (bottom) employs self-attention to fuse current and foresight visual features, serving as the cross-attention condition for a DiT-based action decoder. This decoder directly takes the projected proprioceptive state ($q_t$) and noisy action tokens to generate action chunks. To ensure feature consistency, both systems share the VLM's frozen pre-trained ViT. As indicated by the switches, the training transitions from a decoupled warmup (conditioned on ground-truth features of $o_{t+H}$) to end-to-end optimization (conditioned on $x_t$). Throughout both stages, an MSE loss is applied to align the latent foresight with ground-truth features.
  • Figure 4: Examples from the 24 RoboCasa GR1 Tabletop Tasks, including object rearrangement (e.g., Croissant to Box) and interaction with articulated fixtures (e.g., Bottle to Cabinet).
  • Figure 5: Real-world Tasks and Data Sources. Comparison between human demonstrations from the EgoDex dataset and corresponding robot executions (Pick & Place and Pouring) used for cross-embodiment learning on the IRON-R01-1.11 robot.
  • ...and 7 more figures