L1 Sample Flow for Efficient Visuomotor Learning

Weixi Song; Zhetao Chen; Tao Xu; Xianchao Zeng; Xinyu Zhou; Lixin Yang; Donglin Wang; Cewu Lu; Yong-Lu Li

L1 Sample Flow for Efficient Visuomotor Learning

Weixi Song, Zhetao Chen, Tao Xu, Xianchao Zeng, Xinyu Zhou, Lixin Yang, Donglin Wang, Cewu Lu, Yong-Lu Li

TL;DR

L1 Flow tackles the tension between modeling multi-modal visuomotor demonstrations and achieving fast, scalable inference. By reformulating flow matching into a sample-prediction objective and introducing a two-step inference that first integrates to a midpoint and then directly predicts the terminal action, it captures multi-modality with only two neural function evaluations. Empirical results across MimicGen, RoboMimic/PushT, and real-world tasks show strong performance with substantial speedups (10–70× faster inference) and competitive training efficiency versus fast denoising baselines. The approach offers a practical, end-to-end alternative for real-time robotic manipulation that preserves distributional expressiveness while improving deployment efficiency.

Abstract

Denoising-based models, such as diffusion and flow matching, have been a critical component of robotic manipulation for their strong distribution-fitting and scaling capacity. Concurrently, several works have demonstrated that simple learning objectives, such as L1 regression, can achieve performance comparable to denoising-based methods on certain tasks, while offering faster convergence and inference. In this paper, we focus on how to combine the advantages of these two paradigms: retaining the ability of denoising models to capture multi-modal distributions and avoid mode collapse while achieving the efficiency of the L1 regression objective. To achieve this vision, we reformulate the original v-prediction flow matching and transform it into sample-prediction with the L1 training objective. We empirically show that the multi-modality can be expressed via a single ODE step. Thus, we propose \textbf{L1 Flow}, a two-step sampling schedule that generates a suboptimal action sequence via a single integration step and then reconstructs the precise action sequence through a single prediction. The proposed method largely retains the advantages of flow matching while reducing the iterative neural function evaluations to merely two and mitigating the potential performance degradation associated with direct sample regression. We evaluate our method with varying baselines and benchmarks, including 8 tasks in MimicGen, 5 tasks in RoboMimic \& PushT Bench, and one task in the real-world scenario. The results show the advantages of the proposed method with regard to training efficiency, inference speed, and overall performance. \href{https://song-wx.github.io/l1flow.github.io/}{Project Website.}

L1 Sample Flow for Efficient Visuomotor Learning

TL;DR

Abstract

L1 Sample Flow for Efficient Visuomotor Learning

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (11)