Table of Contents
Fetching ...

L1 Sample Flow for Efficient Visuomotor Learning

Weixi Song, Zhetao Chen, Tao Xu, Xianchao Zeng, Xinyu Zhou, Lixin Yang, Donglin Wang, Cewu Lu, Yong-Lu Li

TL;DR

L1 Flow tackles the tension between modeling multi-modal visuomotor demonstrations and achieving fast, scalable inference. By reformulating flow matching into a sample-prediction objective and introducing a two-step inference that first integrates to a midpoint and then directly predicts the terminal action, it captures multi-modality with only two neural function evaluations. Empirical results across MimicGen, RoboMimic/PushT, and real-world tasks show strong performance with substantial speedups (10–70× faster inference) and competitive training efficiency versus fast denoising baselines. The approach offers a practical, end-to-end alternative for real-time robotic manipulation that preserves distributional expressiveness while improving deployment efficiency.

Abstract

Denoising-based models, such as diffusion and flow matching, have been a critical component of robotic manipulation for their strong distribution-fitting and scaling capacity. Concurrently, several works have demonstrated that simple learning objectives, such as L1 regression, can achieve performance comparable to denoising-based methods on certain tasks, while offering faster convergence and inference. In this paper, we focus on how to combine the advantages of these two paradigms: retaining the ability of denoising models to capture multi-modal distributions and avoid mode collapse while achieving the efficiency of the L1 regression objective. To achieve this vision, we reformulate the original v-prediction flow matching and transform it into sample-prediction with the L1 training objective. We empirically show that the multi-modality can be expressed via a single ODE step. Thus, we propose \textbf{L1 Flow}, a two-step sampling schedule that generates a suboptimal action sequence via a single integration step and then reconstructs the precise action sequence through a single prediction. The proposed method largely retains the advantages of flow matching while reducing the iterative neural function evaluations to merely two and mitigating the potential performance degradation associated with direct sample regression. We evaluate our method with varying baselines and benchmarks, including 8 tasks in MimicGen, 5 tasks in RoboMimic \& PushT Bench, and one task in the real-world scenario. The results show the advantages of the proposed method with regard to training efficiency, inference speed, and overall performance. \href{https://song-wx.github.io/l1flow.github.io/}{Project Website.}

L1 Sample Flow for Efficient Visuomotor Learning

TL;DR

L1 Flow tackles the tension between modeling multi-modal visuomotor demonstrations and achieving fast, scalable inference. By reformulating flow matching into a sample-prediction objective and introducing a two-step inference that first integrates to a midpoint and then directly predicts the terminal action, it captures multi-modality with only two neural function evaluations. Empirical results across MimicGen, RoboMimic/PushT, and real-world tasks show strong performance with substantial speedups (10–70× faster inference) and competitive training efficiency versus fast denoising baselines. The approach offers a practical, end-to-end alternative for real-time robotic manipulation that preserves distributional expressiveness while improving deployment efficiency.

Abstract

Denoising-based models, such as diffusion and flow matching, have been a critical component of robotic manipulation for their strong distribution-fitting and scaling capacity. Concurrently, several works have demonstrated that simple learning objectives, such as L1 regression, can achieve performance comparable to denoising-based methods on certain tasks, while offering faster convergence and inference. In this paper, we focus on how to combine the advantages of these two paradigms: retaining the ability of denoising models to capture multi-modal distributions and avoid mode collapse while achieving the efficiency of the L1 regression objective. To achieve this vision, we reformulate the original v-prediction flow matching and transform it into sample-prediction with the L1 training objective. We empirically show that the multi-modality can be expressed via a single ODE step. Thus, we propose \textbf{L1 Flow}, a two-step sampling schedule that generates a suboptimal action sequence via a single integration step and then reconstructs the precise action sequence through a single prediction. The proposed method largely retains the advantages of flow matching while reducing the iterative neural function evaluations to merely two and mitigating the potential performance degradation associated with direct sample regression. We evaluate our method with varying baselines and benchmarks, including 8 tasks in MimicGen, 5 tasks in RoboMimic \& PushT Bench, and one task in the real-world scenario. The results show the advantages of the proposed method with regard to training efficiency, inference speed, and overall performance. \href{https://song-wx.github.io/l1flow.github.io/}{Project Website.}

Paper Structure

This paper contains 25 sections, 16 equations, 11 figures, 7 tables, 2 algorithms.

Figures (11)

  • Figure 1: Overview of the proposed method.L1 Flow employs a 2-step denoising paradigm which combines the efficiency of L1 regression and the strong distribution-modeling capacity of standard flow matching. Compared with the iterative denoising process of the standard flow matching and the direct mapping of L1 regression, L1 Flow decouples the modeling of multi-modal distribution and the reconstruction of the precise actions. Starting from a random noise, L1 Flow performs one integration step towards the middle timestep and predicts the precise action $x_1$ from the coarse action $x_{0.5}$, which are based on the reformulated sample-prediction type flow matching.
  • Figure 2: Visualization of the sample distribution. Apply the proposed one-step integration to model two sine curves with different phases and compare with L1 Regression. (a) One-step integration effectively captures the multi-modality. (b) The direct regression exhibits the average of two modes, the so-called mode collapse.
  • Figure 3: PDF of the mixed distribution of Logistic Normal and Uniform distribution. The Logistic Normal distribution emphasizes sampling around intermediate timesteps, while we additionally incorporate a low-level uniform distribution to ensure that the probabilities at the boundary timesteps remain non-zero.
  • Figure 4: The trend of the maximum success rate throughout the training. We experiment with our method with baselines in the 8 tasks of MimicGen and report the maximum success rate among 50 evaluations throughout the training. The overall results demonstrate that our method achieves performance comparable to or even surpassing the baselines with only two neural function evaluations (NFE), while also exhibiting higher training efficiency by reaching performance saturation in success rate with fewer training steps. L1 Flow largely retains the advantages of flow matching with less inference budget while mitigating the potential performance degradation associated with direct L1 regression.
  • Figure 5: The real-world evaluation setup. The evaluation is conducted in AGILEX Mobile Aloha platform, and we establish a task with two stages, emphasizing modeling the action multi-modality and prediction precision. (a) Stage 1: The robot is required to pick up the bowl with either the left or the right arm, requiring the policy to model the multi-modal action distribution. (b) Stage 2: The dual-arm is required to move the grasped bowl on top of the other bowls and place it, emphasizing the transformation from multi-modal to single-modal action prediction and also demands action precision, requiring the policy to accurately localize the bowl’s edge. (c) Performance comparison with the standard diffusion policy, including DDPM (100 steps) and DDIM (16 steps) in real-world task. The results show that L1 Flow achieves comparable performance against the baselines and has a superior inference speed, achieving about 10-70$\times$ speed-up.
  • ...and 6 more figures