Mimic Intent, Not Just Trajectories

Renming Huang; Chendong Zeng; Wenjing Tang; Jingtian Cai; Cewu Lu; Panpan Cai

Mimic Intent, Not Just Trajectories

Renming Huang, Chendong Zeng, Wenjing Tang, Jingtian Cai, Cewu Lu, Panpan Cai

TL;DR

MINT tackles the generalization gap in Vision-Language-Action imitation learning by explicitly separating high-level behavioral intent from low-level execution. It introduces a Spectrally Disentangled Action Tokenizer (SDAT) that uses multi-scale frequency-space tokens to disentangle global intent (coarse scale) from execution details (finer scales), and a next-scale autoregressive policy that reasoned from intent to action. The framework supports one-shot skill transfer via explicit intent token injection and demonstrates state-of-the-art performance across LIBERO, CALVIN, and MetaWorld, along with robust generalization under disturbances and real-world transfer with limited demonstrations. This approach provides a principled, scalable path to planning-enabled imitation learning with practical impact on robotic manipulation tasks.

Abstract

While imitation learning (IL) has achieved impressive success in dexterous manipulation through generative modeling and pretraining, state-of-the-art approaches like Vision-Language-Action (VLA) models still struggle with adaptation to environmental changes and skill transfer. We argue this stems from mimicking raw trajectories without understanding the underlying intent. To address this, we propose explicitly disentangling behavior intent from execution details in end-2-end IL: \textit{``Mimic Intent, Not just Trajectories'' (MINT)}. We achieve this via \textit{multi-scale frequency-space tokenization}, which enforces a spectral decomposition of action chunk representation. We learn action tokens with a multi-scale coarse-to-fine structure, and force the coarsest token to capture low-frequency global structure and finer tokens to encode high-frequency details. This yields an abstract \textit{Intent token} that facilitates planning and transfer, and multi-scale \textit{Execution tokens} that enable precise adaptation to environmental dynamics. Building on this hierarchy, our policy generates trajectories through \textit{next-scale autoregression}, performing progressive \textit{intent-to-execution reasoning}, thus boosting learning efficiency and generalization. Crucially, this disentanglement enables \textit{one-shot transfer} of skills, by simply injecting the Intent token from a demonstration into the autoregressive generation process. Experiments on several manipulation benchmarks and on a real robot demonstrate state-of-the-art success rates, superior inference efficiency, robust generalization against disturbances, and effective one-shot transfer.

Mimic Intent, Not Just Trajectories

TL;DR

Abstract

Paper Structure (39 sections, 7 equations, 12 figures, 11 tables, 1 algorithm)

This paper contains 39 sections, 7 equations, 12 figures, 11 tables, 1 algorithm.

Introduction
Related Work
Vision Language Action Models
Action Tokenization
Coarse-to-Fine Tokenization
Overview
Spectrally Disentangled Action Tokenizer
Action Encoder and Spectrum Decoder
Multi-Scale Residual Quantization
Scale-wise Spectral Reconstruction
Training Objective
MINT Policy Learning
Next-Scale Autoregressive Modeling
Intent-Based Action Ensemble
Model Architectures
...and 24 more sections

Figures (12)

Figure 1: Left: We propose Spectrally Disentangled Action Tokenizer, which encodes action chunks into multi-scale tokens via scale-wise frequency domain reconstruction constraints, where the coarsest scale captures global intent and finer scales encode execution residuals. Right: The T-SNE visualization of the $S_1$ token space demonstrates that the learned $S_1$ tokens form distinct clusters corresponding to semantically consistent behaviors (e.g., "Pick up","Move forward" and "Clockwise Rotation")
Figure 2: MINT Policy Overview. (a) MINT autoregressively predicts action tokens across $K$ temporal scales—moving from a global intent token to high-frequency execution tokens—which are subsequently mapped to continuous trajectories via the decoder. (b) Intent-based action ensemble ensures temporal consistency and smooth behavioral transitions, enhancing stability in long-horizon tasks.
Figure 3: One-shot transfer evaluation on OOD tasks in simulation. We evaluate generalization across three compositional shifts: New Layout, New Task, and Extended Horizon.
Figure 4: Real-world Experiment Setup.
Figure 5: Real-world task results. The violin plots show Bayesian posterior success rates. The distinct lettering indicates statistically distinguishable policies.
...and 7 more figures

Mimic Intent, Not Just Trajectories

TL;DR

Abstract

Mimic Intent, Not Just Trajectories

Authors

TL;DR

Abstract

Table of Contents

Figures (12)