Table of Contents
Fetching ...

Long-Horizon Visual Imitation Learning via Plan and Code Reflection

Quan Chen, Chenrui Shi, Qi Chen, Yuwei Wu, Zhi Gao, Xintong Zhang, Rui Gao, Kun Wu, Yunde Jia

TL;DR

This work tackles the difficulty of long-horizon visual imitation by coupling plan and code generation with dual reflection modules that verify and refine both the action plan and the executable code. The proposed LongVIL framework uses plan reflection to ensure temporal and spatial alignment with demonstrations and code reflection to guarantee semantic consistency between plans and generated code, forming a planning–verification–correction loop. A new benchmark, LongVILBench, provides 150 tasks across 3 manipulation domains and 1–18-step sequences to evaluate perception, reasoning, planning, and execution under diverse visual conditions. Empirical results show substantial performance gains over baselines, particularly on long-horizon tasks and under visually complex conditions, and real-world deployment on a UR5e robot demonstrates practical viability. The work highlights the importance of structured self-verification for scaling visual imitation to realistic, multi-step robot tasks.

Abstract

Learning from long-horizon demonstrations with complex action sequences presents significant challenges for visual imitation learning, particularly in understanding temporal relationships of actions and spatial relationships between objects. In this paper, we propose a new agent framework that incorporates two dedicated reflection modules to enhance both plan and code generation. The plan generation module produces an initial action sequence, which is then verified by the plan reflection module to ensure temporal coherence and spatial alignment with the demonstration video. The code generation module translates the plan into executable code, while the code reflection module verifies and refines the generated code to ensure correctness and consistency with the generated plan. These two reflection modules jointly enable the agent to detect and correct errors in both the plan generation and code generation, improving performance in tasks with intricate temporal and spatial dependencies. To support systematic evaluation, we introduce LongVILBench, a benchmark comprising 300 human demonstrations with action sequences of up to 18 steps. LongVILBench emphasizes temporal and spatial complexity across multiple task types. Experimental results demonstrate that existing methods perform poorly on this benchmark, whereas our new framework establishes a strong baseline for long-horizon visual imitation learning.

Long-Horizon Visual Imitation Learning via Plan and Code Reflection

TL;DR

This work tackles the difficulty of long-horizon visual imitation by coupling plan and code generation with dual reflection modules that verify and refine both the action plan and the executable code. The proposed LongVIL framework uses plan reflection to ensure temporal and spatial alignment with demonstrations and code reflection to guarantee semantic consistency between plans and generated code, forming a planning–verification–correction loop. A new benchmark, LongVILBench, provides 150 tasks across 3 manipulation domains and 1–18-step sequences to evaluate perception, reasoning, planning, and execution under diverse visual conditions. Empirical results show substantial performance gains over baselines, particularly on long-horizon tasks and under visually complex conditions, and real-world deployment on a UR5e robot demonstrates practical viability. The work highlights the importance of structured self-verification for scaling visual imitation to realistic, multi-step robot tasks.

Abstract

Learning from long-horizon demonstrations with complex action sequences presents significant challenges for visual imitation learning, particularly in understanding temporal relationships of actions and spatial relationships between objects. In this paper, we propose a new agent framework that incorporates two dedicated reflection modules to enhance both plan and code generation. The plan generation module produces an initial action sequence, which is then verified by the plan reflection module to ensure temporal coherence and spatial alignment with the demonstration video. The code generation module translates the plan into executable code, while the code reflection module verifies and refines the generated code to ensure correctness and consistency with the generated plan. These two reflection modules jointly enable the agent to detect and correct errors in both the plan generation and code generation, improving performance in tasks with intricate temporal and spatial dependencies. To support systematic evaluation, we introduce LongVILBench, a benchmark comprising 300 human demonstrations with action sequences of up to 18 steps. LongVILBench emphasizes temporal and spatial complexity across multiple task types. Experimental results demonstrate that existing methods perform poorly on this benchmark, whereas our new framework establishes a strong baseline for long-horizon visual imitation learning.

Paper Structure

This paper contains 36 sections, 15 equations, 4 figures, 4 tables.

Figures (4)

  • Figure 1: Example images from LongVILBench: LongVILBench includes 150 tasks and 300 human demonstration videos, with tasks grouped into three levels based on the number of actions involved. The generated codes can be verified both in real world and simulation.
  • Figure 2: An overview of our agent framework. Our agent framework is comprised of four key modules: plan generation module, plan reflection module, code generation module, and code reflection module. Together, our agent framework turns a human demonstration video into a code program, which can be executed in a simulator or a real-world robot.
  • Figure 3: Qualitative comparison between the baseline agent and the agent with reflection modules.
  • Figure 4: Real-world execution of symbolic plans on the UR5e robot. These examples demonstrate successful transfer of the generated programs from simulation to real-world settings.