Table of Contents
Fetching ...

Action-to-Action Flow Matching

Jindou Jia, Gen Li, Xiangyu Chen, Tuo An, Yuxuan Hu, Jingliang Li, Xinying Guo, Jianfei Yang

TL;DR

Diffusion-based robot policies suffer from high inference latency due to noise-to-action sampling. Action-to-Action Flow Matching (A2A) replaces random noise initialization with history-informed latent initialization and learns a latent-space flow to map past actions to future actions, enabling efficient one-step generation. A2A achieves up to 20× faster training and a 0.56 ms per-step inference, while improving robustness to visual perturbations and generalization to unseen configurations, with an extension to video generation (F2F) demonstrating temporal modeling versatility. The approach highlights real-time capability gains for robotic control and suggests broad applicability to other temporally coherent tasks.

Abstract

Diffusion-based policies have recently achieved remarkable success in robotics by formulating action prediction as a conditional denoising process. However, the standard practice of sampling from random Gaussian noise often requires multiple iterative steps to produce clean actions, leading to high inference latency that incurs a major bottleneck for real-time control. In this paper, we challenge the necessity of uninformed noise sampling and propose Action-to-Action flow matching (A2A), a novel policy paradigm that shifts from random sampling to initialization informed by the previous action. Unlike existing methods that treat proprioceptive action feedback as static conditions, A2A leverages historical proprioceptive sequences, embedding them into a high-dimensional latent space as the starting point for action generation. This design bypasses costly iterative denoising while effectively capturing the robot's physical dynamics and temporal continuity. Extensive experiments demonstrate that A2A exhibits high training efficiency, fast inference speed, and improved generalization. Notably, A2A enables high-quality action generation in as few as a single inference step (0.56 ms latency), and exhibits superior robustness to visual perturbations and enhanced generalization to unseen configurations. Lastly, we also extend A2A to video generation, demonstrating its broader versatility in temporal modeling. Project site: https://lorenzo-0-0.github.io/A2A_Flow_Matching.

Action-to-Action Flow Matching

TL;DR

Diffusion-based robot policies suffer from high inference latency due to noise-to-action sampling. Action-to-Action Flow Matching (A2A) replaces random noise initialization with history-informed latent initialization and learns a latent-space flow to map past actions to future actions, enabling efficient one-step generation. A2A achieves up to 20× faster training and a 0.56 ms per-step inference, while improving robustness to visual perturbations and generalization to unseen configurations, with an extension to video generation (F2F) demonstrating temporal modeling versatility. The approach highlights real-time capability gains for robotic control and suggests broad applicability to other temporally coherent tasks.

Abstract

Diffusion-based policies have recently achieved remarkable success in robotics by formulating action prediction as a conditional denoising process. However, the standard practice of sampling from random Gaussian noise often requires multiple iterative steps to produce clean actions, leading to high inference latency that incurs a major bottleneck for real-time control. In this paper, we challenge the necessity of uninformed noise sampling and propose Action-to-Action flow matching (A2A), a novel policy paradigm that shifts from random sampling to initialization informed by the previous action. Unlike existing methods that treat proprioceptive action feedback as static conditions, A2A leverages historical proprioceptive sequences, embedding them into a high-dimensional latent space as the starting point for action generation. This design bypasses costly iterative denoising while effectively capturing the robot's physical dynamics and temporal continuity. Extensive experiments demonstrate that A2A exhibits high training efficiency, fast inference speed, and improved generalization. Notably, A2A enables high-quality action generation in as few as a single inference step (0.56 ms latency), and exhibits superior robustness to visual perturbations and enhanced generalization to unseen configurations. Lastly, we also extend A2A to video generation, demonstrating its broader versatility in temporal modeling. Project site: https://lorenzo-0-0.github.io/A2A_Flow_Matching.
Paper Structure (28 sections, 6 equations, 18 figures, 3 tables)

This paper contains 28 sections, 6 equations, 18 figures, 3 tables.

Figures (18)

  • Figure 1: Comparison of robotic policy paradigms. (a) Regression Policy: Deterministic mapping from multi-modal inputs to actions. (b) Diffusion Policy: Generative modeling via iterative denoising from Gaussian noise. (c) A2A Policy: Informed action generation through a structured flow between historical and future actions. Action-to-action allows for more efficient transport than noise-to-action, enabling one-step flow mapping feasible even with a lightweight MLP architecture.
  • Figure 2: Overview of A2A architecture. The framework consists of three main components. 1) A condition path that encodes visual observations using a ResNet-18 backbone and a linear projector to generate a global condition $c$. 2) A source path that employs a CNN with a 5 kernel size to compress the $n$-frame history actions into a latent starting point $\mathbf{z}_0$. 3) A flow-based generation process. The flow net, built with AdaLN-MLP blocks, predicts the vector field to transport $\mathbf{z}_0$ to the target latent $\mathbf{z}_1$ within a unified 512-dimensional latent space. Finally, a residual MLP decoder transforms $\mathbf{z}_1$ into the future action sequence.
  • Figure 3: Simulational tasks. Simulations are conducted in the Roboverse platform geng2025roboverse. Implementary tasks include Stack Cube and Pick Cube from ManiSkill mu2021maniskill, Close Box from RLBench james2020rlbench, Open Drawer and Pick-Place Bowl from LIBERO liu2023libero). For the last two tasks, the camera is repositioned further back compared to the initial setup. This degrades the clarity of the captured visual input, thereby augmenting the overall task difficulty.
  • Figure 4: Training efficiency test. Left: Success rates across varying training epochs (using 100 demonstrations in Close Box task). Right: Success rates across varying demonstration numbers (fixed at 100 epochs in Stack Cube task).
  • Figure 5: Experimental results of Pick Cube task. (a) Policies are trained on a limited dataset of 30 trajectories for 100 epochs. During evaluation, each method is tested over 10 trials. (b) Generalization capability is further challenged by replacing the target with an unseen glowing block. (c) Pick the cube from different locations with a limited 10 training demonstrations.
  • ...and 13 more figures