Mimic Intent, Not Just Trajectories
Renming Huang, Chendong Zeng, Wenjing Tang, Jingtian Cai, Cewu Lu, Panpan Cai
TL;DR
MINT tackles the generalization gap in Vision-Language-Action imitation learning by explicitly separating high-level behavioral intent from low-level execution. It introduces a Spectrally Disentangled Action Tokenizer (SDAT) that uses multi-scale frequency-space tokens to disentangle global intent (coarse scale) from execution details (finer scales), and a next-scale autoregressive policy that reasoned from intent to action. The framework supports one-shot skill transfer via explicit intent token injection and demonstrates state-of-the-art performance across LIBERO, CALVIN, and MetaWorld, along with robust generalization under disturbances and real-world transfer with limited demonstrations. This approach provides a principled, scalable path to planning-enabled imitation learning with practical impact on robotic manipulation tasks.
Abstract
While imitation learning (IL) has achieved impressive success in dexterous manipulation through generative modeling and pretraining, state-of-the-art approaches like Vision-Language-Action (VLA) models still struggle with adaptation to environmental changes and skill transfer. We argue this stems from mimicking raw trajectories without understanding the underlying intent. To address this, we propose explicitly disentangling behavior intent from execution details in end-2-end IL: \textit{``Mimic Intent, Not just Trajectories'' (MINT)}. We achieve this via \textit{multi-scale frequency-space tokenization}, which enforces a spectral decomposition of action chunk representation. We learn action tokens with a multi-scale coarse-to-fine structure, and force the coarsest token to capture low-frequency global structure and finer tokens to encode high-frequency details. This yields an abstract \textit{Intent token} that facilitates planning and transfer, and multi-scale \textit{Execution tokens} that enable precise adaptation to environmental dynamics. Building on this hierarchy, our policy generates trajectories through \textit{next-scale autoregression}, performing progressive \textit{intent-to-execution reasoning}, thus boosting learning efficiency and generalization. Crucially, this disentanglement enables \textit{one-shot transfer} of skills, by simply injecting the Intent token from a demonstration into the autoregressive generation process. Experiments on several manipulation benchmarks and on a real robot demonstrate state-of-the-art success rates, superior inference efficiency, robust generalization against disturbances, and effective one-shot transfer.
