MaxMI: A Maximal Mutual Information Criterion for Manipulation Concept Discovery
Pei Zhou, Yanchao Yang
TL;DR
This work tackles the challenge of grounding manipulation concepts from unannotated demonstrations by introducing a Maximal Mutual Information (MaxMI) criterion that identifies physically significant key states without human labeling. It develops a Key State Localization Network (KSL-Net) that localizes key states by maximizing the MI between a key state and its preceding state, with MI estimated by a differentiable neural estimator, and integrates discovered concepts into a concept-guided manipulation policy via the CoTPC framework. The approach is evaluated on complex tasks from ManiSkill2 and Franka Kitchen, showing that discovered concepts align with human semantics, enrich concept granularity, and yield higher success rates and better generalization than strong baselines, including in unseen configurations and zero-shot scenarios. Overall, the MaxMI-based framework reduces labeling needs, enhances grounding between low-level states and high-level manipulation concepts, and improves policy performance and robustness in diverse robotic tasks.
Abstract
We aim to discover manipulation concepts embedded in the unannotated demonstrations, which are recognized as key physical states. The discovered concepts can facilitate training manipulation policies and promote generalization. Current methods relying on multimodal foundation models for deriving key states usually lack accuracy and semantic consistency due to limited multimodal robot data. In contrast, we introduce an information-theoretic criterion to characterize the regularities that signify a set of physical states. We also develop a framework that trains a concept discovery network using this criterion, thus bypassing the dependence on human semantics and alleviating costly human labeling. The proposed criterion is based on the observation that key states, which deserve to be conceptualized, often admit more physical constraints than non-key states. This phenomenon can be formalized as maximizing the mutual information between the putative key state and its preceding state, i.e., Maximal Mutual Information (MaxMI). By employing MaxMI, the trained key state localization network can accurately identify states of sufficient physical significance, exhibiting reasonable semantic compatibility with human perception. Furthermore, the proposed framework produces key states that lead to concept-guided manipulation policies with higher success rates and better generalization in various robotic tasks compared to the baselines, verifying the effectiveness of the proposed criterion.
