Table of Contents
Fetching ...

LaMP: Learning Vision-Language-Action Policies with 3D Scene Flow as Latent Motion Prior

Xinkai Wang, Chenyi Wang, Yifu Xu, Mingzhe Ye, Fu-Cheng Zhang, Jialin Tian, Xinyu Zhan, Lifeng Zhu, Cewu Lu, Lixin Yang

Abstract

We introduce \textbf{LaMP}, a dual-expert Vision-Language-Action framework that embeds dense 3D scene flow as a latent motion prior for robotic manipulation. Existing VLA models regress actions directly from 2D semantic visual features, forcing them to learn complex 3D physical interactions implicitly. This implicit learning strategy degrades under unfamiliar spatial dynamics. LaMP addresses this limitation by aligning a flow-matching \emph{Motion Expert} with a policy-predicting \emph{Action Expert} through gated cross-attention. Specifically, the Motion Expert generates a one-step partially denoised 3D scene flow, and its hidden states condition the Action Expert without full multi-step reconstruction. We evaluate LaMP on the LIBERO, LIBERO-Plus, and SimplerEnv-WidowX simulation benchmarks as well as real-world experiments. LaMP consistently outperforms evaluated VLA baselines across LIBERO, LIBERO-Plus, and SimplerEnv-WidowX benchmarks, achieving the highest reported average success rates under the same training budgets. On LIBERO-Plus OOD perturbations, LaMP shows improved robustness with an average 9.7% gain over the strongest prior baseline. Our project page is available at https://summerwxk.github.io/lamp-project-page/.

LaMP: Learning Vision-Language-Action Policies with 3D Scene Flow as Latent Motion Prior

Abstract

We introduce \textbf{LaMP}, a dual-expert Vision-Language-Action framework that embeds dense 3D scene flow as a latent motion prior for robotic manipulation. Existing VLA models regress actions directly from 2D semantic visual features, forcing them to learn complex 3D physical interactions implicitly. This implicit learning strategy degrades under unfamiliar spatial dynamics. LaMP addresses this limitation by aligning a flow-matching \emph{Motion Expert} with a policy-predicting \emph{Action Expert} through gated cross-attention. Specifically, the Motion Expert generates a one-step partially denoised 3D scene flow, and its hidden states condition the Action Expert without full multi-step reconstruction. We evaluate LaMP on the LIBERO, LIBERO-Plus, and SimplerEnv-WidowX simulation benchmarks as well as real-world experiments. LaMP consistently outperforms evaluated VLA baselines across LIBERO, LIBERO-Plus, and SimplerEnv-WidowX benchmarks, achieving the highest reported average success rates under the same training budgets. On LIBERO-Plus OOD perturbations, LaMP shows improved robustness with an average 9.7% gain over the strongest prior baseline. Our project page is available at https://summerwxk.github.io/lamp-project-page/.

Paper Structure

This paper contains 30 sections, 5 equations, 6 figures, 4 tables, 2 algorithms.

Figures (6)

  • Figure 1: LaMP: a Vision-Language-Action Model with Latent Motion Prior. LaMP introduces a dense 3D motion prior between VLM perception and action generation. The Motion Expert provides geometric foresight via one-step denoised features, which guide the Action Expert through gated cross-attention for physically grounded manipulation. LaMP achieves superior performance across real-world tasks and simulation benchmarks, significantly outperforming prior VLAs.
  • Figure 2: Overview of LaMP. (a) Motion Pre-training: The Motion Expert learns 3D scene flow prediction conditioned on VLM features. (b) Action Post-Training: Frozen Motion Expert provides one-step denoised features that fuse with VLM representations via Gated Cross-Attention. (c) Data Curation: 1.6M observation-language-motion triplets from diverse robot embodiments.
  • Figure 3: Ablation study on the SimplerEnv-WidowX benchmark. Success rates of LaMP are compared against variants without motion priors and with 2D flow priors.
  • Figure 4: Ablation study on fusion strategies. Success rates of Gated Cross-Attention are compared against Concat-MLP and Add variants across four manipulation tasks.
  • Figure 5: Real-world experiment platform and task overview.(a) In-domain tasks: Pick-and-Place, Deformable manipulation, and Long-horizon tasks. (b) top: Hardware setup: Flexiv Rizon 4 arm with Robotiq 2F-85 gripper and Intel D415 camera. bottom: OOD test conditions: unseen layout, object, and background. (c) Visualizations of motion foresight. The predicted motion trajectories are overlaid on the current observation. The color gradient from blue to red indicates the temporal progression of the predicted motion.
  • ...and 1 more figures