Table of Contents
Fetching ...

VLM See, Robot Do: Human Demo Video to Robot Action Plan via Vision Language Model

Beichen Wang, Juexiao Zhang, Shuwen Dong, Irving Fang, Chen Feng

TL;DR

This work presents SeeDo, a modular pipeline that enables a Vision-Language Model to interpret long-horizon human demonstration videos and generate robot task plans executable via language-model programs. By combining a hand-driven keyframe selector, visual prompting for robust object perception, and CoT-enhanced VLM reasoning, SeeDo achieves superior temporal and spatial understanding compared with strong video-based baselines and demonstrates deployment in both simulation and real hardware. The study introduces a specialized benchmark with three long-horizon categories and novel TSR/FSR/SSR metrics, plus ablations that highlight the importance of keyframe selection and visual prompts. Limitations include a restricted action space, incomplete spatial reasoning, and precision challenges in spatial positioning, with future work aimed at expanding actions and improving spatial intelligence.

Abstract

Vision Language Models (VLMs) have recently been adopted in robotics for their capability in common sense reasoning and generalizability. Existing work has applied VLMs to generate task and motion planning from natural language instructions and simulate training data for robot learning. In this work, we explore using VLM to interpret human demonstration videos and generate robot task planning. Our method integrates keyframe selection, visual perception, and VLM reasoning into a pipeline. We named it SeeDo because it enables the VLM to ''see'' human demonstrations and explain the corresponding plans to the robot for it to ''do''. To validate our approach, we collected a set of long-horizon human videos demonstrating pick-and-place tasks in three diverse categories and designed a set of metrics to comprehensively benchmark SeeDo against several baselines, including state-of-the-art video-input VLMs. The experiments demonstrate SeeDo's superior performance. We further deployed the generated task plans in both a simulation environment and on a real robot arm.

VLM See, Robot Do: Human Demo Video to Robot Action Plan via Vision Language Model

TL;DR

This work presents SeeDo, a modular pipeline that enables a Vision-Language Model to interpret long-horizon human demonstration videos and generate robot task plans executable via language-model programs. By combining a hand-driven keyframe selector, visual prompting for robust object perception, and CoT-enhanced VLM reasoning, SeeDo achieves superior temporal and spatial understanding compared with strong video-based baselines and demonstrates deployment in both simulation and real hardware. The study introduces a specialized benchmark with three long-horizon categories and novel TSR/FSR/SSR metrics, plus ablations that highlight the importance of keyframe selection and visual prompts. Limitations include a restricted action space, incomplete spatial reasoning, and precision challenges in spatial positioning, with future work aimed at expanding actions and improving spatial intelligence.

Abstract

Vision Language Models (VLMs) have recently been adopted in robotics for their capability in common sense reasoning and generalizability. Existing work has applied VLMs to generate task and motion planning from natural language instructions and simulate training data for robot learning. In this work, we explore using VLM to interpret human demonstration videos and generate robot task planning. Our method integrates keyframe selection, visual perception, and VLM reasoning into a pipeline. We named it SeeDo because it enables the VLM to ''see'' human demonstrations and explain the corresponding plans to the robot for it to ''do''. To validate our approach, we collected a set of long-horizon human videos demonstrating pick-and-place tasks in three diverse categories and designed a set of metrics to comprehensively benchmark SeeDo against several baselines, including state-of-the-art video-input VLMs. The experiments demonstrate SeeDo's superior performance. We further deployed the generated task plans in both a simulation environment and on a real robot arm.

Paper Structure

This paper contains 15 sections, 6 figures, 3 tables, 3 algorithms.

Figures (6)

  • Figure 1: VLM See, Robot Do. We designed an agent framework centered around a large Vision Language Model to interpret long-horizon human demonstration videos into task plans in natural language, which are then executed in simulated and real-world robots via language model programs and action primitive functions.
  • Figure 2: The SeeDo agent consists of three modules. From left to right, a) The Keyframe Selection module detects the operating hand in the video and plots its speed. The speed valleys are identified as keyframes. b) The Visual Prompting module detects and tracks objects and then applies the tracking results as visual prompts to each keyframe. c) The VLM Interpreter module instructs the GPT-4o to interpret keyframes, identify objects and actions in each keyframe, and generate task plans from the demonstration video. d) Plan Execution. The generated task plans are processed by Code-as-Policies into language model programs (LMPs) and call the robot APIs for execution.
  • Figure 3: We collect long-horizon human demonstration videos across three diverse categories as our benchmark and carry out both simulation and real-world experiments. Tasks from left to right: vegetable organization, garment organization, and wooden block stacking.
  • Figure 4: Results visualization on all three tasks.
  • Figure 5: Error type percentages of all the failure cases of all the methods. Note that error types are not exclusive. The barplot of the total success rates on all tasks is also presented. LLaVA-OV represents the LLaVA-OneVision model.
  • ...and 1 more figures