Table of Contents
Fetching ...

Robust Finetuning of Vision-Language-Action Robot Policies via Parameter Merging

Yajat Yadav, Zhiyuan Zhou, Andrew Wagenmaker, Karl Pertsch, Sergey Levine

TL;DR

Generalist robot policies generalize across diverse environments but struggle when fine-tuned on limited demonstrations. RETAIN addresses this by linearly interpolating pretrained and finetuned policy weights, with extensions for co-finetuning and modality-specific merging, to preserve generalist capabilities while learning a new task. Across real-world and simulated experiments, RETAIN improves out-of-distribution generalization on the new task and retains generalist skills, with performance that scales with more pretraining data and supports continual learning of multiple skills. The approach offers a simple, effective, and scalable solution for robust finetuning in low-data regimes, enabling lifelong skill acquisition without catastrophic forgetting.

Abstract

Generalist robot policies, trained on large and diverse datasets, have demonstrated the ability to generalize across a wide spectrum of behaviors, enabling a single policy to act in varied real-world environments. However, they still fall short on new tasks not covered in the training data. When finetuned on limited demonstrations of a new task, these policies often overfit to the specific demonstrations--not only losing their prior abilities to solve a wide variety of generalist tasks but also failing to generalize within the new task itself. In this work, we aim to develop a method that preserves the generalization capabilities of the generalist policy during finetuning, allowing a single policy to robustly incorporate a new skill into its repertoire. Our goal is a single policy that both learns to generalize to variations of the new task and retains the broad competencies gained from pretraining. We show that this can be achieved through a simple yet effective strategy: interpolating the weights of a finetuned model with that of the pretrained model. We show, across extensive simulated and real-world experiments, that such model merging produces a single model that inherits the generalist abilities of the base model and learns to solve the new task robustly, outperforming both the pretrained and finetuned model on out-of-distribution variations of the new task. Moreover, we show that model merging performance scales with the amount of pretraining data, and enables continual acquisition of new skills in a lifelong learning setting, without sacrificing previously learned generalist abilities.

Robust Finetuning of Vision-Language-Action Robot Policies via Parameter Merging

TL;DR

Generalist robot policies generalize across diverse environments but struggle when fine-tuned on limited demonstrations. RETAIN addresses this by linearly interpolating pretrained and finetuned policy weights, with extensions for co-finetuning and modality-specific merging, to preserve generalist capabilities while learning a new task. Across real-world and simulated experiments, RETAIN improves out-of-distribution generalization on the new task and retains generalist skills, with performance that scales with more pretraining data and supports continual learning of multiple skills. The approach offers a simple, effective, and scalable solution for robust finetuning in low-data regimes, enabling lifelong skill acquisition without catastrophic forgetting.

Abstract

Generalist robot policies, trained on large and diverse datasets, have demonstrated the ability to generalize across a wide spectrum of behaviors, enabling a single policy to act in varied real-world environments. However, they still fall short on new tasks not covered in the training data. When finetuned on limited demonstrations of a new task, these policies often overfit to the specific demonstrations--not only losing their prior abilities to solve a wide variety of generalist tasks but also failing to generalize within the new task itself. In this work, we aim to develop a method that preserves the generalization capabilities of the generalist policy during finetuning, allowing a single policy to robustly incorporate a new skill into its repertoire. Our goal is a single policy that both learns to generalize to variations of the new task and retains the broad competencies gained from pretraining. We show that this can be achieved through a simple yet effective strategy: interpolating the weights of a finetuned model with that of the pretrained model. We show, across extensive simulated and real-world experiments, that such model merging produces a single model that inherits the generalist abilities of the base model and learns to solve the new task robustly, outperforming both the pretrained and finetuned model on out-of-distribution variations of the new task. Moreover, we show that model merging performance scales with the amount of pretraining data, and enables continual acquisition of new skills in a lifelong learning setting, without sacrificing previously learned generalist abilities.

Paper Structure

This paper contains 57 sections, 4 equations, 23 figures, 18 tables.

Figures (23)

  • Figure 1: State-of-the-art generalist policies typically consist of a vision encoder, language model backbone, and action decoder ("action expert").
  • Figure 2: Example filmstrips of the in-distribution (ID) tasks and out-of-distribution (OOD) tasks from DROID robot experiments in \ref{['sec:experiments']}.
  • Figure 3: The standard approach for policy finetuning often overfits. As the policy is trained for more gradient steps, it performs worse on tasks other than the new target task ("GENERALIST") and may even start to degrade on scenarios seen in the finetuning data ("ID"). Most importantly, it is not able to transfer the generality of a base policy to do well under variations of the target task (new object positions, instances, viewpoints; "OOD").
  • Figure 4: RETAIN enables continual merging of new skills into generalist policy backbones.
  • Figure 5: We evaluate policy finetuning on two real-world DROID finetuning tasks (left, middle) and three simulated LIBERO finetuning tasks (right, only one visualized here). In each task, we collect a modest number of demonstrations ($50$--$100$) in a comparatively narrow setting (blue), but evaluate on a much broader set of variations for the same task (yellow), including variations to scene, object instances, initial positions, lighting conditions, distractors, and viewpoints. This tests transfer of the generalization ability of the pretrained policy to the target task. Example trajectories in \ref{['fig:filmstrips']}.
  • ...and 18 more figures