Table of Contents
Fetching ...

Learning Generalizable Visuomotor Policy through Dynamics-Alignment

Dohyeok Lee, Jung Min Lee, Munkyung Kim, Seokhun Ju, Jin Woo Koo, Kyungjae Lee, Dohyeong Kim, TaeHyun Cho, Jungwoo Lee

TL;DR

This work tackles the generalization gap in behavior-cloned robotic policies by introducing Dynamics-Aligned Flow Matching Policy (DAP), which explicitly models action-conditioned dynamics and couples dynamics prediction with policy generation through shared flow samples. By training a dynamics model and a policy in a flow-matching framework and enabling mutual correction via dynamics alignment and flow extrapolation, DAP achieves superior real-world manipulation performance and robustness to visual disturbances. The approach yields notable gains on challenging tasks like Cup Arrangement and demonstrates strong out-of-distribution generalization across novel objects and lighting conditions, with minimal computational overhead and real-time inference. Overall, DAP offers a practical, data-efficient path to more generalizable visuomotor control without requiring vast, task-agnostic pretraining.

Abstract

Behavior cloning methods for robot learning suffer from poor generalization due to limited data support beyond expert demonstrations. Recent approaches leveraging video prediction models have shown promising results by learning rich spatiotemporal representations from large-scale datasets. However, these models learn action-agnostic dynamics that cannot distinguish between different control inputs, limiting their utility for precise manipulation tasks and requiring large pretraining datasets. We propose a Dynamics-Aligned Flow Matching Policy (DAP) that integrates dynamics prediction into policy learning. Our method introduces a novel architecture where policy and dynamics models provide mutual corrective feedback during action generation, enabling self-correction and improved generalization. Empirical validation demonstrates generalization performance superior to baseline methods on real-world robotic manipulation tasks, showing particular robustness in OOD scenarios including visual distractions and lighting variations.

Learning Generalizable Visuomotor Policy through Dynamics-Alignment

TL;DR

This work tackles the generalization gap in behavior-cloned robotic policies by introducing Dynamics-Aligned Flow Matching Policy (DAP), which explicitly models action-conditioned dynamics and couples dynamics prediction with policy generation through shared flow samples. By training a dynamics model and a policy in a flow-matching framework and enabling mutual correction via dynamics alignment and flow extrapolation, DAP achieves superior real-world manipulation performance and robustness to visual disturbances. The approach yields notable gains on challenging tasks like Cup Arrangement and demonstrates strong out-of-distribution generalization across novel objects and lighting conditions, with minimal computational overhead and real-time inference. Overall, DAP offers a practical, data-efficient path to more generalizable visuomotor control without requiring vast, task-agnostic pretraining.

Abstract

Behavior cloning methods for robot learning suffer from poor generalization due to limited data support beyond expert demonstrations. Recent approaches leveraging video prediction models have shown promising results by learning rich spatiotemporal representations from large-scale datasets. However, these models learn action-agnostic dynamics that cannot distinguish between different control inputs, limiting their utility for precise manipulation tasks and requiring large pretraining datasets. We propose a Dynamics-Aligned Flow Matching Policy (DAP) that integrates dynamics prediction into policy learning. Our method introduces a novel architecture where policy and dynamics models provide mutual corrective feedback during action generation, enabling self-correction and improved generalization. Empirical validation demonstrates generalization performance superior to baseline methods on real-world robotic manipulation tasks, showing particular robustness in OOD scenarios including visual distractions and lighting variations.

Paper Structure

This paper contains 24 sections, 6 equations, 5 figures, 2 tables, 1 algorithm.

Figures (5)

  • Figure 1: The dynamics model $f(o_t,a_t)$ (left) learns from mixed expert and random data using flow matching VAE-DiT architecture with cross-attention conditioning. The policy $\pi(a_t|o_t,o_{t+1})$ generates actions through flow matching with cross-attention U-Net architecture. Flow extrapolation (blue arrows) enables mutual correction: flow samples at timestep $\tau$ are extrapolated to predict $o_{t+1}$ and $a_t$, aligning action generation with dynamics predictions. This iterative parallel generation ensures consistent and robust behaviors while maintaining real-time performance.
  • Figure 2: Top: Quantitative evaluation of extrapolated samples during flow generation. The peak-signal-to-noise-ratio (PSNR) values of extrapolated next observation for each flow timestep $\tau$ is provided (Cam. 1 Ext. PSNR and Cam. 2 Ext. PSNR) with PSNR of final prediction (Cam. 1 PSNR and Cam. 2 PSNR). Likewise, we provide the mean squared error (MSE) values of extrapolated action for each $\tau$ (Act. Ext. MSE) with MSE of final prediction (Act. MSE). Evaluation is done on entire validation dataset. Bottom: Qualitative samples of extrapolated next observation for each flow timestep $\tau$.
  • Figure 3: Top: Success rates across four real-world manipulation tasks. DAP achieves an average success rate of 75.0%, outperforming DP-C (62.5%), DP-T (57.5%), UVA (32.5%), and $\pi_0$ (30.0%). Notably, DAP is the only method that successfully performs the challenging Cup Arrangement task (30%). Bottom: task setup for each manipulation scenario. Lower Right: trainable parameters comparison showing DAP (369M) maintains comparable model size to baselines.
  • Figure 4: Top: Success rate under three OOD scenarios: Novel Object, Random Light, and Visual Distractor. DAP achieves the highest average success rate (36.7%). Bottom: Representative test scenarios. Lower Right: OOD objects used in evaluation.
  • Figure 5: Ablation study with flow matching policy and dynamics model