Policy Adaptation via Language Optimization: Decomposing Tasks for Few-Shot Imitation

Vivek Myers; Bill Chunyuan Zheng; Oier Mees; Sergey Levine; Kuan Fang

Policy Adaptation via Language Optimization: Decomposing Tasks for Few-Shot Imitation

Vivek Myers, Bill Chunyuan Zheng, Oier Mees, Sergey Levine, Kuan Fang

TL;DR

This work tackles the problem of adapting language-conditioned robot policies to unseen, long-horizon tasks from only a few demonstrations. It introduces Policy Adaptation via Language Optimization (PALO), which uses vision-language models to decompose high-level instructions into subtasks and jointly optimize the decomposition with trajectory partitions to enable rapid nonparametric adaptation without large fine-tuning. The approach is supported by regret analysis that decomposes out-of-distribution performance into the pretraining policy’s in-distribution error and the VLM’s decomposition accuracy, plus sampling-related terms. Empirically, PALO achieves strong performance on real-world BridgeDataV2 tasks, outperforming zero-shot and finetuned baselines across multiple scenes and demonstrating robust long-horizon behavior with as few as five demonstrations, underscoring the practical value of semantic task structure for robotic adaptation.

Abstract

Learned language-conditioned robot policies often struggle to effectively adapt to new real-world tasks even when pre-trained across a diverse set of instructions. We propose a novel approach for few-shot adaptation to unseen tasks that exploits the semantic understanding of task decomposition provided by vision-language models (VLMs). Our method, Policy Adaptation via Language Optimization (PALO), combines a handful of demonstrations of a task with proposed language decompositions sampled from a VLM to quickly enable rapid nonparametric adaptation, avoiding the need for a larger fine-tuning dataset. We evaluate PALO on extensive real-world experiments consisting of challenging unseen, long-horizon robot manipulation tasks. We find that PALO is able of consistently complete long-horizon, multi-tier tasks in the real world, outperforming state of the art pre-trained generalist policies, and methods that have access to the same demonstrations.

Policy Adaptation via Language Optimization: Decomposing Tasks for Few-Shot Imitation

TL;DR

Abstract

Paper Structure (45 sections, 6 theorems, 39 equations, 14 figures, 3 tables, 1 algorithm)

This paper contains 45 sections, 6 theorems, 39 equations, 14 figures, 3 tables, 1 algorithm.

Introduction
Related Work
Few-shot learning.
Language-conditioned robotic control.
Foundation models and robotics.
Notation
Problem Statement
Task Decomposition with Language
Few-Shot Adaptation through Language Decomposition
Learning Composable Instruction-Following Primitives
Analysis of
System Details
Experiments
Experimental Setup
...and 30 more sections

Key Result

theorem 1

The (out-of-distribution) regret of on $\rhotarget$ can be bounded as: where $\ourpi$ is the result of algo:overview, $\hat{\pi}(s_{t},\l)$ is the policy trained on $\Dprior$ (sec:policy_learning), and $t\sim \operatorname{Unif}(1\ldots H)$.

Figures (14)

Figure 1: An overview of the PALO algorithm for few-shot adaptation with language. (Left) We build off a pre-trained policy that has learned to follow low-level language instructions from a large dataset of expert demonstrations. (Middle) Given a new task and a few expert demonstrations, we use a VLM to propose candidate decompositions into subtasks. We optimize over these decompositions jointly with the partitions of trajectories into subtasks, selecting the the subtask decomposition that minimizes the validation error of the learned policy. (Right) At test time, we condition the pre-trained policy on the selected decomposition to solve the task.
Figure 2: PALO enables pre-trained generalist policies to adapt new tasks with as few as five demonstrations by searching in language space instead of parameter space.
Figure 3: A visualization of an example execution of our method on the long-horizon task "put the beet toy in the drawer." The VLM deconstructs $\ell$ into candidate high-level subtasks $\lh_{1:K}$ and low-level subtasks $\ll_{1:K}$ and optimizes over the expert trajectories. The optimal $\lh_{1:K}$ and $\ll_{1:m}$ are chosen and unrolled in real-world evaluations, which result in successful completion of the task (trajectory shown in gray).
Figure 4: Sample rollouts using PALO on unseen testing tasks.
Figure 5: Comparison of with baseline methods on different scenes with one standard error.
...and 9 more figures

Theorems & Definitions (11)

theorem 1
proof
lemma 1: alquier2024userfriendly
lemma 2
lemma 3
lemma 4
lemma 5
proof : Proof of \ref{['thm:overlap']}
proof : Proof of \ref{['thm:exp_bound']}
proof : Proof of \ref{['thm:vlm_error']}
...and 1 more

Policy Adaptation via Language Optimization: Decomposing Tasks for Few-Shot Imitation

TL;DR

Abstract

Policy Adaptation via Language Optimization: Decomposing Tasks for Few-Shot Imitation

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (14)

Theorems & Definitions (11)