Motion Manifold Flow Primitives for Task-Conditioned Trajectory Generation under Complex Task-Motion Dependencies

Yonghyeon Lee; Byeongho Lee; Seungyeon Kim; Frank C. Park

Motion Manifold Flow Primitives for Task-Conditioned Trajectory Generation under Complex Task-Motion Dependencies

Yonghyeon Lee, Byeongho Lee, Seungyeon Kim, Frank C. Park

TL;DR

The paper tackles the challenge of generating diverse, task-conditioned trajectories from limited demonstrations, especially under complex task-motion dependencies induced by language. It introduces Motion Manifold Flow Primitives (MMFP), which decouples manifold learning from conditional density modeling and uses flow matching in the latent space of a learned trajectory manifold to capture difficult dependencies. The method encodes text as a task parameter via a text encoder and a learned embedding, then evolves latent codes with a conditioned velocity field $v_s(z,c)$ and decodes them into high-dimensional trajectories, achieving superior performance on SE(3) pouring and 7-DoF waving tasks with few demonstrations. Empirical results show MMFP outperforms diffusion-based and previous manifold-based approaches, yielding accurate, diverse, and smooth global trajectories suitable for real robots; future work could extend to visual inputs and continuous-time representations.

Abstract

Effective movement primitives should be capable of encoding and generating a rich repertoire of trajectories -- typically collected from human demonstrations -- conditioned on task-defining parameters such as vision or language inputs. While recent methods based on the motion manifold hypothesis, which assumes that a set of trajectories lies on a lower-dimensional nonlinear subspace, address challenges such as limited dataset size and the high dimensionality of trajectory data, they often struggle to capture complex task-motion dependencies, i.e., when motion distributions shift drastically with task variations. To address this, we introduce Motion Manifold Flow Primitives (MMFP), a framework that decouples the training of the motion manifold from task-conditioned distributions. Specifically, we employ flow matching models, state-of-the-art conditional deep generative models, to learn task-conditioned distributions in the latent coordinate space of the learned motion manifold. Experiments are conducted on language-guided trajectory generation tasks, where many-to-many text-motion correspondences introduce complex task-motion dependencies, highlighting MMFP's superiority over existing methods.

Motion Manifold Flow Primitives for Task-Conditioned Trajectory Generation under Complex Task-Motion Dependencies

TL;DR

and decodes them into high-dimensional trajectories, achieving superior performance on SE(3) pouring and 7-DoF waving tasks with few demonstrations. Empirical results show MMFP outperforms diffusion-based and previous manifold-based approaches, yielding accurate, diverse, and smooth global trajectories suitable for real robots; future work could extend to visual inputs and continuous-time representations.

Abstract

Paper Structure (16 sections, 7 equations, 11 figures, 3 tables)

This paper contains 16 sections, 7 equations, 11 figures, 3 tables.

Introduction
Related Works
Movement primitives
Diffusion and flow matching for imitation learning
Preliminaries
Autoencoder-based manifold learning
Flow matching models
Motion Manifold Flow Primitives
Limitations of existing motion manifold primitives
Decoupling manifold learning and conditional densities
Language as a task parameter
Experiments
Latent diffusion vs latent flow matching
SE(3) pouring motion generation
7-DoF waving motion generation
...and 1 more sections

Figures (11)

Figure 1: An illustrative example of a language-guided navigation scenario with complex task-motion dependencies: The motion distribution shifts dramatically -- for instance, in the number of modalities -- when the task parameter (in this case, a language command) changes.
Figure 2: The procedure of motion generation in MMFP: (i) the Sentence-BERT encodes a free-form text into a vector $c$, (ii) the text embedding model $h$ maps $c$ to a text embedding vector $\tau$, (iii) we solve the ODE $z'=v_s(z,\tau)$ from $s=0$ to $s=1$ with an initial value $z_0$ sampled from Gaussian ${\cal N}(z|0,I)$ and obtain $z_1 \in {\cal Z}$, and (iv) the motion decoder $f$ maps $z_1$ to a trajectory $x=(q_1,\ldots,q_T)$.
Figure 3: Demonstration trajectories with multiple text annotations. Each trajectory is assigned with two text labels, the level 1 and level 2 texts.
Figure 4: The evolution of generated trajectories (from left to right) follows each of the latent models, either diffusion or flow, with the trajectories on the far right representing the final output samples.
Figure 5: Left: Example demonstration trajectories, each of which is annotated with three different level texts. Right: Generated pouring trajectories by RFM, TCVAE, MMP and MMFP.
...and 6 more figures

Motion Manifold Flow Primitives for Task-Conditioned Trajectory Generation under Complex Task-Motion Dependencies

TL;DR

Abstract

Motion Manifold Flow Primitives for Task-Conditioned Trajectory Generation under Complex Task-Motion Dependencies

Authors

TL;DR

Abstract

Table of Contents

Figures (11)