Table of Contents
Fetching ...

Chain-of-Modality: Learning Manipulation Programs from Multimodal Human Videos with Vision-Language-Models

Chen Wang, Fei Xia, Wenhao Yu, Tingnan Zhang, Ruohan Zhang, C. Karen Liu, Li Fei-Fei, Jie Tan, Jacky Liang

TL;DR

Chain-of-Modality (CoM) introduces a prompting strategy that sequentially analyzes multimodal human demonstrations—vision, muscle activity, and audio—to extract fine-grained task plans and control parameters and then generate robot-executable code from a single video. By progressively integrating modalities, CoM improves task-plan and parameter extraction accuracy over baselines and generalizes to unseen objects and cross-embodiment robots. The method yields a practical one-shot imitation workflow, achieving notable real-world robot success and demonstrating the potential of vision-language models for cross-embodiment robotics. Limitations include constrained audio analysis and open-loop execution, pointing to future work on closed-loop control and richer audio cues.

Abstract

Learning to perform manipulation tasks from human videos is a promising approach for teaching robots. However, many manipulation tasks require changing control parameters during task execution, such as force, which visual data alone cannot capture. In this work, we leverage sensing devices such as armbands that measure human muscle activities and microphones that record sound, to capture the details in the human manipulation process, and enable robots to extract task plans and control parameters to perform the same task. To achieve this, we introduce Chain-of-Modality (CoM), a prompting strategy that enables Vision Language Models to reason about multimodal human demonstration data -- videos coupled with muscle or audio signals. By progressively integrating information from each modality, CoM refines a task plan and generates detailed control parameters, enabling robots to perform manipulation tasks based on a single multimodal human video prompt. Our experiments show that CoM delivers a threefold improvement in accuracy for extracting task plans and control parameters compared to baselines, with strong generalization to new task setups and objects in real-world robot experiments. Videos and code are available at https://chain-of-modality.github.io

Chain-of-Modality: Learning Manipulation Programs from Multimodal Human Videos with Vision-Language-Models

TL;DR

Chain-of-Modality (CoM) introduces a prompting strategy that sequentially analyzes multimodal human demonstrations—vision, muscle activity, and audio—to extract fine-grained task plans and control parameters and then generate robot-executable code from a single video. By progressively integrating modalities, CoM improves task-plan and parameter extraction accuracy over baselines and generalizes to unseen objects and cross-embodiment robots. The method yields a practical one-shot imitation workflow, achieving notable real-world robot success and demonstrating the potential of vision-language models for cross-embodiment robotics. Limitations include constrained audio analysis and open-loop execution, pointing to future work on closed-loop control and richer audio cues.

Abstract

Learning to perform manipulation tasks from human videos is a promising approach for teaching robots. However, many manipulation tasks require changing control parameters during task execution, such as force, which visual data alone cannot capture. In this work, we leverage sensing devices such as armbands that measure human muscle activities and microphones that record sound, to capture the details in the human manipulation process, and enable robots to extract task plans and control parameters to perform the same task. To achieve this, we introduce Chain-of-Modality (CoM), a prompting strategy that enables Vision Language Models to reason about multimodal human demonstration data -- videos coupled with muscle or audio signals. By progressively integrating information from each modality, CoM refines a task plan and generates detailed control parameters, enabling robots to perform manipulation tasks based on a single multimodal human video prompt. Our experiments show that CoM delivers a threefold improvement in accuracy for extracting task plans and control parameters compared to baselines, with strong generalization to new task setups and objects in real-world robot experiments. Videos and code are available at https://chain-of-modality.github.io

Paper Structure

This paper contains 11 sections, 5 figures, 2 tables.

Figures (5)

  • Figure 1: We introduce Chain-of-Modality (CoM), a prompting strategy that enables VLMs to recognize human task plans from a single multimodal video with force or audio information, and generate corresponding robot control code to reproduce the task.
  • Figure 2: Overview of Chain-of-Modality (using force as an example). (a) Baseline method - Merged: Merges multimodal information (vision, force, and hand pose) into a single input batch and queries the VLM to directly generate the final answer. (b) Chain-of-Modality (CoM): Analyzes each modality step-by-step, refining the analysis to produce the final answer. Example: First, the VLM uses force data to determine when force is applied. Then, with hand pose information, it infers that the human is grasping and twisting. Next, with image data, the VLM identifies the action as twisting a bottle cap. Finally, VLM transform the CoM analysis into a robot-executable Python program to reproduce the task.
  • Figure 3: Overview of Experiment Tasks. (a) Multimodal Human Video Input: Our framework processes a single-shot human video with force or audio data, using Chain-of-Modality to extract the task plan and control parameters, then generates a robot control program. (b) Robot Code Execution: The robot executes the program to replicate the task observed in the video. (c) Evaluation Setups: We evaluate the performance of generated program in various experimental setups.
  • Figure 4: Qualitative results for Chain-of-Modality. We showcase task plans generated by CoM for four evaluation videos. CoM successfully segments the videos into subtasks, specifying the skills, force, and target objects at each stage.
  • Figure 5: Quantitative results for Chain-of-Modality. We compare CoM with baselines across three tasks using both Gemini and GPT.