Table of Contents
Fetching ...

SAIL: Test-Time Scaling for In-Context Imitation Learning with VLM

Makoto Sato, Yusuke Iwasawa, Yujin Tang, So Kuroki

TL;DR

SAIL is a framework that reframes robot imitation as an iterative refinement problem capable of scaling with test-time compute, and utilizes Monte Carlo Tree Search, where each node is a complete trajectory and edges correspond to trajectory refinements.

Abstract

In-context imitation learning allows robots to acquire skills from demonstrations, yet one-shot trajectory generation remains fragile under environmental variation. We propose SAIL, a framework that reframes robot imitation as an iterative refinement problem capable of scaling with test-time compute. SAIL utilizes Monte Carlo Tree Search, where each node is a complete trajectory and edges correspond to trajectory refinements. The process is guided by three core components: an automated archive of successful trajectories for contextually relevant retrieval, a vision language model-based scoring mechanism for trajectory evaluation, and a step-level feedback that provides trajectory-aligned scores for iterative refinement. Experiments across six diverse manipulation tasks in simulation and real-world validation clearly demonstrate that increasing test-time compute consistently improves success rates, achieving up to 95% on complex tasks. Our results suggest that trajectory-level test-time scaling is a robust path toward more generalizable robotic agents.

SAIL: Test-Time Scaling for In-Context Imitation Learning with VLM

TL;DR

SAIL is a framework that reframes robot imitation as an iterative refinement problem capable of scaling with test-time compute, and utilizes Monte Carlo Tree Search, where each node is a complete trajectory and edges correspond to trajectory refinements.

Abstract

In-context imitation learning allows robots to acquire skills from demonstrations, yet one-shot trajectory generation remains fragile under environmental variation. We propose SAIL, a framework that reframes robot imitation as an iterative refinement problem capable of scaling with test-time compute. SAIL utilizes Monte Carlo Tree Search, where each node is a complete trajectory and edges correspond to trajectory refinements. The process is guided by three core components: an automated archive of successful trajectories for contextually relevant retrieval, a vision language model-based scoring mechanism for trajectory evaluation, and a step-level feedback that provides trajectory-aligned scores for iterative refinement. Experiments across six diverse manipulation tasks in simulation and real-world validation clearly demonstrate that increasing test-time compute consistently improves success rates, achieving up to 95% on complex tasks. Our results suggest that trajectory-level test-time scaling is a robust path toward more generalizable robotic agents.
Paper Structure (20 sections, 4 equations, 3 figures, 3 tables)

This paper contains 20 sections, 4 equations, 3 figures, 3 tables.

Figures (3)

  • Figure 1: Left: We perform test-time scaling with MCTS over VLM-proposed trajectories, using similarity-based retrieval from an archive of successful rollouts and step-level feedback from a VLM evaluator. Right: Success improves with more expanded MCTS nodes (test-time compute); random retrieval and weak feedback saturate early, whereas our retrieval and step-level feedback keep yielding gains.
  • Figure 2: Method overview. SAIL refines trajectories at test time via MCTS, where each node is a complete trajectory proposed by a policy VLM for a seed. (1) A shared trajectory archive stores successful rollouts across seeds and retrieves visually similar trajectories as in-context demonstrations. (2) Each proposal is executed and scored by a scoring VLM that (i) decomposes the task into ordered subtasks from one demo and (ii) estimates per-frame completion to yield a scalar node value. (3) The scoring VLM aligns the progress scores to waypoints to provide step-level feedback for the next refinement.
  • Figure 3: Digital-twin experimental setup for the BlockIntoBowl task. The task consists of three sequential steps: (1) grasp the blue block, (2) move to the bowl, and (3) release the block into the bowl. The left panel shows the simulation environment, and the right panel shows the corresponding real-world setup.