Table of Contents
Fetching ...

From Flow to One Step: Real-Time Multi-Modal Trajectory Policies via Implicit Maximum Likelihood Estimation-based Distribution Distillation

Ju Dong, Liding Zhang, Lei Zhang, Yu Fu, Kaixin Bai, Zoltan-Csaba Marton, Zhenshan Bing, Zhaopeng Chen, Alois Christian Knoll, Jianwei Zhang

TL;DR

This work proposes a framework that distills a Conditional Flow Matching expert into a fast single-step student via Implicit Maximum Likelihood Estimation (IMLE), and proposes a unified perception encoder that integrates multi-view RGB, depth, point clouds, and proprioception into a geometry-aware representation.

Abstract

Generative policies based on diffusion and flow matching achieve strong performance in robotic manipulation by modeling multi-modal human demonstrations. However, their reliance on iterative Ordinary Differential Equation (ODE) integration introduces substantial latency, limiting high-frequency closed-loop control. Recent single-step acceleration methods alleviate this overhead but often exhibit distributional collapse, producing averaged trajectories that fail to execute coherent manipulation strategies. We propose a framework that distills a Conditional Flow Matching (CFM) expert into a fast single-step student via Implicit Maximum Likelihood Estimation (IMLE). A bi-directional Chamfer distance provides a set-level objective that promotes both mode coverage and fidelity, enabling preservation of the teacher multi-modal action distribution in a single forward pass. A unified perception encoder further integrates multi-view RGB, depth, point clouds, and proprioception into a geometry-aware representation. The resulting high-frequency control supports real-time receding-horizon re-planning and improved robustness under dynamic disturbances.

From Flow to One Step: Real-Time Multi-Modal Trajectory Policies via Implicit Maximum Likelihood Estimation-based Distribution Distillation

TL;DR

This work proposes a framework that distills a Conditional Flow Matching expert into a fast single-step student via Implicit Maximum Likelihood Estimation (IMLE), and proposes a unified perception encoder that integrates multi-view RGB, depth, point clouds, and proprioception into a geometry-aware representation.

Abstract

Generative policies based on diffusion and flow matching achieve strong performance in robotic manipulation by modeling multi-modal human demonstrations. However, their reliance on iterative Ordinary Differential Equation (ODE) integration introduces substantial latency, limiting high-frequency closed-loop control. Recent single-step acceleration methods alleviate this overhead but often exhibit distributional collapse, producing averaged trajectories that fail to execute coherent manipulation strategies. We propose a framework that distills a Conditional Flow Matching (CFM) expert into a fast single-step student via Implicit Maximum Likelihood Estimation (IMLE). A bi-directional Chamfer distance provides a set-level objective that promotes both mode coverage and fidelity, enabling preservation of the teacher multi-modal action distribution in a single forward pass. A unified perception encoder further integrates multi-view RGB, depth, point clouds, and proprioception into a geometry-aware representation. The resulting high-frequency control supports real-time receding-horizon re-planning and improved robustness under dynamic disturbances.
Paper Structure (28 sections, 9 equations, 5 figures, 4 tables)

This paper contains 28 sections, 9 equations, 5 figures, 4 tables.

Figures (5)

  • Figure 1: Overview of the distribution-level distillation framework and data diversity. Top: Teacher–student distillation framework and student-generated multi-modal trajectory distributions at inference time. Bottom: Example manipulation tasks demonstrating diverse trajectory data collected during training.
  • Figure 2: Detailed training pipeline. A unified multimodal encoder conditions both the ODE-based CFM teacher and the single-step student. The student is distilled from diverse teacher trajectories using set-level IMLE with a bi-directional Chamfer objective.
  • Figure 3: Qualitative results on RLBench tasks. Tasks are ordered from left to right and top to bottom: close door, open box, open fridge, open oven, take frame off hanger, unplug charger, take shoes out of box, and put books on bookshelf.
  • Figure 4: Three real-world setups. Cameras 1--3 correspond to the back, wrist, and external cameras, respectively.
  • Figure 5: Mode-Collapse Failures of the 1-Step Baseline.