Table of Contents
Fetching ...

ICLR: In-Context Imitation Learning with Visual Reasoning

Toan Nguyen, Weiduo Yuan, Songlin Wei, Hui Li, Daniel Seita, Yue Wang

TL;DR

This work presents In-Context Imitation Learning with Visual Reasoning (ICLR), a novel framework that augments demonstration prompts with structured visual reasoning traces representing anticipated future robot trajectories in image space and suggests that incorporating embodied visual reasoning represents a promising direction for enhancing the robustness and generalization of robotic in-context learning systems.

Abstract

In-context imitation learning enables robots to adapt to new tasks from a small number of demonstrations without additional training. However, existing approaches typically condition only on state-action trajectories and lack explicit representations of task intent. This limitation hinders performance in complex and ambiguous task settings where the same actions may be consistent with different objectives. To address this, we present In-Context Imitation Learning with Visual Reasoning (ICLR), a novel framework that augments demonstration prompts with structured visual reasoning traces representing anticipated future robot trajectories in image space. ICLR also jointly learns to generate reasoning traces and low-level actions within a unified autoregressive transformer, enabling the model to mimic not only action prediction but also the reasoning process that leads to those actions. We extensively evaluate ICLR in both simulation and real-world manipulation tasks and demonstrate consistent improvements in success rates and generalization to unseen tasks and novel object configurations compared to other in-context imitation learning methods. These results suggest that incorporating embodied visual reasoning represents a promising direction for enhancing the robustness and generalization of robotic in-context learning systems.

ICLR: In-Context Imitation Learning with Visual Reasoning

TL;DR

This work presents In-Context Imitation Learning with Visual Reasoning (ICLR), a novel framework that augments demonstration prompts with structured visual reasoning traces representing anticipated future robot trajectories in image space and suggests that incorporating embodied visual reasoning represents a promising direction for enhancing the robustness and generalization of robotic in-context learning systems.

Abstract

In-context imitation learning enables robots to adapt to new tasks from a small number of demonstrations without additional training. However, existing approaches typically condition only on state-action trajectories and lack explicit representations of task intent. This limitation hinders performance in complex and ambiguous task settings where the same actions may be consistent with different objectives. To address this, we present In-Context Imitation Learning with Visual Reasoning (ICLR), a novel framework that augments demonstration prompts with structured visual reasoning traces representing anticipated future robot trajectories in image space. ICLR also jointly learns to generate reasoning traces and low-level actions within a unified autoregressive transformer, enabling the model to mimic not only action prediction but also the reasoning process that leads to those actions. We extensively evaluate ICLR in both simulation and real-world manipulation tasks and demonstrate consistent improvements in success rates and generalization to unseen tasks and novel object configurations compared to other in-context imitation learning methods. These results suggest that incorporating embodied visual reasoning represents a promising direction for enhancing the robustness and generalization of robotic in-context learning systems.
Paper Structure (17 sections, 1 equation, 7 figures, 3 tables)

This paper contains 17 sections, 1 equation, 7 figures, 3 tables.

Figures (7)

  • Figure 1: General framework overview. Our method augments prompt demos with keypoint-based visual reasoning traces in the image space, shown above with the overlaid polyline in the middle column. During inference, the model also performs visual reasoning before predicting the subsequent low-level robot action. The task's language description is included for clarity.
  • Figure 2: Method overview.(A) To generate the visual reasoning trace at a given time step, we uniformly sample five third-view images from that time step to the end of the trajectory and use Molmo2 to predict the gripper’s pixel location in each image. (B) Multi-view camera observations and proprioceptive states are encoded by a state encoder to produce state tokens $f_s$. Visual reasoning traces are embedded by a reasoning encoder to produce reasoning tokens $f_r$, and actions are embedded by an action encoder to produce action tokens $f_a$. (C) These modality-specific tokens are interleaved and fed into a causal transformer, which autoregressively predicts the next reasoning trace followed by the corresponding action. During training, teacher forcing is applied over reasoning and action tokens. In inference, the model first generates a reasoning trace and then produces the action in a closed-loop manner.
  • Figure 3: Real robot setting. We use a Franka Research 3 robot arm equipped with a UMI gripper. Visual observations are captured by two RealSense cameras. Teleoperation for data collection and test-time prompt demonstration recording is performed using a GELLO system. Testing objects appearing in training episodes are shown in the bottom-left box, while completely unseen testing objects are shown in the bottom-right box.
  • Figure 4: Qualitative results. Rollout examples of our complete ICLR model in simulation (first two rows) and real-world settings (two bottom rows). All presented visual traces are predicted by our model.
  • Figure 5: Three types of prompt demonstrations. The task of picking up the tomato and putting it in the grey bowl is selected.
  • ...and 2 more figures