Chain-of-Modality: Learning Manipulation Programs from Multimodal Human Videos with Vision-Language-Models

Chen Wang; Fei Xia; Wenhao Yu; Tingnan Zhang; Ruohan Zhang; C. Karen Liu; Li Fei-Fei; Jie Tan; Jacky Liang

Chain-of-Modality: Learning Manipulation Programs from Multimodal Human Videos with Vision-Language-Models

Chen Wang, Fei Xia, Wenhao Yu, Tingnan Zhang, Ruohan Zhang, C. Karen Liu, Li Fei-Fei, Jie Tan, Jacky Liang

TL;DR

Chain-of-Modality (CoM) introduces a prompting strategy that sequentially analyzes multimodal human demonstrations—vision, muscle activity, and audio—to extract fine-grained task plans and control parameters and then generate robot-executable code from a single video. By progressively integrating modalities, CoM improves task-plan and parameter extraction accuracy over baselines and generalizes to unseen objects and cross-embodiment robots. The method yields a practical one-shot imitation workflow, achieving notable real-world robot success and demonstrating the potential of vision-language models for cross-embodiment robotics. Limitations include constrained audio analysis and open-loop execution, pointing to future work on closed-loop control and richer audio cues.

Abstract

Learning to perform manipulation tasks from human videos is a promising approach for teaching robots. However, many manipulation tasks require changing control parameters during task execution, such as force, which visual data alone cannot capture. In this work, we leverage sensing devices such as armbands that measure human muscle activities and microphones that record sound, to capture the details in the human manipulation process, and enable robots to extract task plans and control parameters to perform the same task. To achieve this, we introduce Chain-of-Modality (CoM), a prompting strategy that enables Vision Language Models to reason about multimodal human demonstration data -- videos coupled with muscle or audio signals. By progressively integrating information from each modality, CoM refines a task plan and generates detailed control parameters, enabling robots to perform manipulation tasks based on a single multimodal human video prompt. Our experiments show that CoM delivers a threefold improvement in accuracy for extracting task plans and control parameters compared to baselines, with strong generalization to new task setups and objects in real-world robot experiments. Videos and code are available at https://chain-of-modality.github.io

Chain-of-Modality: Learning Manipulation Programs from Multimodal Human Videos with Vision-Language-Models

TL;DR

Abstract

Chain-of-Modality: Learning Manipulation Programs from Multimodal Human Videos with Vision-Language-Models

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (5)