ICLR: In-Context Imitation Learning with Visual Reasoning

Toan Nguyen; Weiduo Yuan; Songlin Wei; Hui Li; Daniel Seita; Yue Wang

ICLR: In-Context Imitation Learning with Visual Reasoning

Toan Nguyen, Weiduo Yuan, Songlin Wei, Hui Li, Daniel Seita, Yue Wang

TL;DR

This work presents In-Context Imitation Learning with Visual Reasoning (ICLR), a novel framework that augments demonstration prompts with structured visual reasoning traces representing anticipated future robot trajectories in image space and suggests that incorporating embodied visual reasoning represents a promising direction for enhancing the robustness and generalization of robotic in-context learning systems.

Abstract

In-context imitation learning enables robots to adapt to new tasks from a small number of demonstrations without additional training. However, existing approaches typically condition only on state-action trajectories and lack explicit representations of task intent. This limitation hinders performance in complex and ambiguous task settings where the same actions may be consistent with different objectives. To address this, we present In-Context Imitation Learning with Visual Reasoning (ICLR), a novel framework that augments demonstration prompts with structured visual reasoning traces representing anticipated future robot trajectories in image space. ICLR also jointly learns to generate reasoning traces and low-level actions within a unified autoregressive transformer, enabling the model to mimic not only action prediction but also the reasoning process that leads to those actions. We extensively evaluate ICLR in both simulation and real-world manipulation tasks and demonstrate consistent improvements in success rates and generalization to unseen tasks and novel object configurations compared to other in-context imitation learning methods. These results suggest that incorporating embodied visual reasoning represents a promising direction for enhancing the robustness and generalization of robotic in-context learning systems.

ICLR: In-Context Imitation Learning with Visual Reasoning

TL;DR

Abstract

Paper Structure (17 sections, 1 equation, 7 figures, 3 tables)

This paper contains 17 sections, 1 equation, 7 figures, 3 tables.

INTRODUCTION
RELATED WORK
In-Context Imitation Learning
Robotic Embodied Reasoning
PROBLEM STATEMENT
METHOD
Training Data Formulation
Visual Reasoning Trace Generation
In-Context Imitation Learning with Visual Reasoning
EXPERIMENTS
Models
Simulation Experiments
Real Robot Experiments
Ablation Studies
DISCUSSION
...and 2 more sections

Figures (7)

Figure 1: General framework overview. Our method augments prompt demos with keypoint-based visual reasoning traces in the image space, shown above with the overlaid polyline in the middle column. During inference, the model also performs visual reasoning before predicting the subsequent low-level robot action. The task's language description is included for clarity.
Figure 2: Method overview.(A) To generate the visual reasoning trace at a given time step, we uniformly sample five third-view images from that time step to the end of the trajectory and use Molmo2 to predict the gripper’s pixel location in each image. (B) Multi-view camera observations and proprioceptive states are encoded by a state encoder to produce state tokens $f_s$. Visual reasoning traces are embedded by a reasoning encoder to produce reasoning tokens $f_r$, and actions are embedded by an action encoder to produce action tokens $f_a$. (C) These modality-specific tokens are interleaved and fed into a causal transformer, which autoregressively predicts the next reasoning trace followed by the corresponding action. During training, teacher forcing is applied over reasoning and action tokens. In inference, the model first generates a reasoning trace and then produces the action in a closed-loop manner.
Figure 3: Real robot setting. We use a Franka Research 3 robot arm equipped with a UMI gripper. Visual observations are captured by two RealSense cameras. Teleoperation for data collection and test-time prompt demonstration recording is performed using a GELLO system. Testing objects appearing in training episodes are shown in the bottom-left box, while completely unseen testing objects are shown in the bottom-right box.
Figure 4: Qualitative results. Rollout examples of our complete ICLR model in simulation (first two rows) and real-world settings (two bottom rows). All presented visual traces are predicted by our model.
Figure 5: Three types of prompt demonstrations. The task of picking up the tomato and putting it in the grey bowl is selected.
...and 2 more figures

ICLR: In-Context Imitation Learning with Visual Reasoning

TL;DR

Abstract

ICLR: In-Context Imitation Learning with Visual Reasoning

Authors

TL;DR

Abstract

Table of Contents

Figures (7)