Table of Contents
Fetching ...

VITA: Vision-to-Action Flow Matching Policy

Dechen Gao, Boqi Zhao, Andrew Lee, Ian Chuang, Hanchu Zhou, Hang Wang, Zhe Zhao, Junshan Zhang, Iman Soltani

TL;DR

VITA rethinks visuomotor policy learning by removing visual conditioning from flow-based generation and sourcing the flow from a latent visual representation to a learned latent action space. An action autoencoder bridges the dimensionality gap, while Flow Latent Decoding backpropagates action reconstruction through the ODE solver to prevent latent collapse, enabling end-to-end training. The approach yields 1.5×–2× faster inference and 18.6%–28.7% lower memory usage, while achieving state-of-the-art or competitive success rates across 9 simulation and 5 real-world tasks (ALOHA and Robomimic). The work shows that a conditioning-free, MLPl-based pipeline can handle high-precision visuomotor tasks, with grid-based latents further scalable to transformers without expensive conditioning modules.

Abstract

Conventional flow matching and diffusion-based policies sample through iterative denoising from standard noise distributions (e.g., Gaussian), and require conditioning modules to repeatedly incorporate visual information during the generative process, incurring substantial time and memory overhead. To reduce the complexity, we develop VITA(VIsion-To-Action policy), a noise-free and conditioning-free flow matching policy learning framework that directly flows from visual representations to latent actions. Since the source of the flow is visually grounded, VITA eliminates the need of visual conditioning during generation. As expected, bridging vision and action is challenging, because actions are lower-dimensional, less structured, and sparser than visual representations; moreover, flow matching requires the source and target to have the same dimensionality. To overcome this, we introduce an action autoencoder that maps raw actions into a structured latent space aligned with visual latents, trained jointly with flow matching. To further prevent latent space collapse, we propose flow latent decoding, which anchors the latent generation process by backpropagating the action reconstruction loss through the flow matching ODE (ordinary differential equation) solving steps. We evaluate VITA on 9 simulation and 5 real-world tasks from ALOHA and Robomimic. VITA achieves 1.5x-2x faster inference compared to conventional methods with conditioning modules, while outperforming or matching state-of-the-art policies. Codes, datasets, and demos are available at our project page: https://ucd-dare.github.io/VITA/.

VITA: Vision-to-Action Flow Matching Policy

TL;DR

VITA rethinks visuomotor policy learning by removing visual conditioning from flow-based generation and sourcing the flow from a latent visual representation to a learned latent action space. An action autoencoder bridges the dimensionality gap, while Flow Latent Decoding backpropagates action reconstruction through the ODE solver to prevent latent collapse, enabling end-to-end training. The approach yields 1.5×–2× faster inference and 18.6%–28.7% lower memory usage, while achieving state-of-the-art or competitive success rates across 9 simulation and 5 real-world tasks (ALOHA and Robomimic). The work shows that a conditioning-free, MLPl-based pipeline can handle high-precision visuomotor tasks, with grid-based latents further scalable to transformers without expensive conditioning modules.

Abstract

Conventional flow matching and diffusion-based policies sample through iterative denoising from standard noise distributions (e.g., Gaussian), and require conditioning modules to repeatedly incorporate visual information during the generative process, incurring substantial time and memory overhead. To reduce the complexity, we develop VITA(VIsion-To-Action policy), a noise-free and conditioning-free flow matching policy learning framework that directly flows from visual representations to latent actions. Since the source of the flow is visually grounded, VITA eliminates the need of visual conditioning during generation. As expected, bridging vision and action is challenging, because actions are lower-dimensional, less structured, and sparser than visual representations; moreover, flow matching requires the source and target to have the same dimensionality. To overcome this, we introduce an action autoencoder that maps raw actions into a structured latent space aligned with visual latents, trained jointly with flow matching. To further prevent latent space collapse, we propose flow latent decoding, which anchors the latent generation process by backpropagating the action reconstruction loss through the flow matching ODE (ordinary differential equation) solving steps. We evaluate VITA on 9 simulation and 5 real-world tasks from ALOHA and Robomimic. VITA achieves 1.5x-2x faster inference compared to conventional methods with conditioning modules, while outperforming or matching state-of-the-art policies. Codes, datasets, and demos are available at our project page: https://ucd-dare.github.io/VITA/.

Paper Structure

This paper contains 59 sections, 2 theorems, 11 equations, 21 figures, 12 tables.

Key Result

Theorem 1

Under assump:decoder-regularity, for any $\hat{\bm z}_1$ in the neighborhood, we have $m\,\|\hat{\bm z}_1-\bm z_1\| - \varepsilon_{\mathrm{AE}} \ \le\ \|\mathcal{D}_a(\hat{\bm z}_1)-A\| \ \le\ L\,\|\hat{\bm z}_1-\bm z_1\| + \varepsilon_{\mathrm{AE}}$. If $\varepsilon_{\mathrm{AE}}=0$, the minimizers

Figures (21)

  • Figure 1: A comparison between VITA and conventional flow matching and diffusion policies. Unlike conventional methods that sample noise from standard distributions and inject input modalities via conditioning, VITA poses no constraints on the source distribution, and flows directly from latent visual representations to latent actions, eliminating the need for conditioning modules.
  • Figure 2: An overview of the VITA architecture: The vision encoder maps observations into a source latent representation $\bm{z}_0$ for the flow; the action encoder provides a target latent representation $\bm{z}_1$ for flow matching training. The action decoder learns to decode $\hat{\bm{z}}_1$ (latent actions generated by solving ODEs) to actions via flow latent decoding losses, and decode $\bm{z}_1$ to actions (latent actions from action encoder) via autoencoder losses. The flow matching network learns the velocity field over a continuous flow matching path from $\bm{z}_0$ to $\bm{z}_1$.
  • Figure 3: Autonomous rollouts of VITA across 5 AV-ALOHA tasks (CubeTransfer, SlotInsertion, HookPackage, PourTestTube, ThreadNeedle), and 2 Robomimic tasks (Square, Can). Notably, the AV-ALOHA tasks demand high-precision control, such as accurately pouring a small ball into a narrow tube opening, or threading a needle through a tiny hole.
  • Figure 4: Autonomous rollouts of VITA on five challenging real-world tasks, including two bimanual AV-ALOHA tasks, HiddenPick, and TransferFromBox using active vision, and three single-arm ALOHA tasks, PickBall, ToothBrush and StoreDrawer
  • Figure 5: Comparison of reconstructed actions between (a) VITA , and (b) VITA without FLD. Reconstruction fails without FLD because of latent space collapse.
  • ...and 16 more figures

Theorems & Definitions (2)

  • Theorem 1: Local equivalence of FLD and FLC
  • Lemma 1: Local bi-Lipschitzness from Jacobian bounds