Action-to-Action Flow Matching
Jindou Jia, Gen Li, Xiangyu Chen, Tuo An, Yuxuan Hu, Jingliang Li, Xinying Guo, Jianfei Yang
TL;DR
Diffusion-based robot policies suffer from high inference latency due to noise-to-action sampling. Action-to-Action Flow Matching (A2A) replaces random noise initialization with history-informed latent initialization and learns a latent-space flow to map past actions to future actions, enabling efficient one-step generation. A2A achieves up to 20× faster training and a 0.56 ms per-step inference, while improving robustness to visual perturbations and generalization to unseen configurations, with an extension to video generation (F2F) demonstrating temporal modeling versatility. The approach highlights real-time capability gains for robotic control and suggests broad applicability to other temporally coherent tasks.
Abstract
Diffusion-based policies have recently achieved remarkable success in robotics by formulating action prediction as a conditional denoising process. However, the standard practice of sampling from random Gaussian noise often requires multiple iterative steps to produce clean actions, leading to high inference latency that incurs a major bottleneck for real-time control. In this paper, we challenge the necessity of uninformed noise sampling and propose Action-to-Action flow matching (A2A), a novel policy paradigm that shifts from random sampling to initialization informed by the previous action. Unlike existing methods that treat proprioceptive action feedback as static conditions, A2A leverages historical proprioceptive sequences, embedding them into a high-dimensional latent space as the starting point for action generation. This design bypasses costly iterative denoising while effectively capturing the robot's physical dynamics and temporal continuity. Extensive experiments demonstrate that A2A exhibits high training efficiency, fast inference speed, and improved generalization. Notably, A2A enables high-quality action generation in as few as a single inference step (0.56 ms latency), and exhibits superior robustness to visual perturbations and enhanced generalization to unseen configurations. Lastly, we also extend A2A to video generation, demonstrating its broader versatility in temporal modeling. Project site: https://lorenzo-0-0.github.io/A2A_Flow_Matching.
