Robust Finetuning of Vision-Language-Action Robot Policies via Parameter Merging

Yajat Yadav; Zhiyuan Zhou; Andrew Wagenmaker; Karl Pertsch; Sergey Levine

Robust Finetuning of Vision-Language-Action Robot Policies via Parameter Merging

Yajat Yadav, Zhiyuan Zhou, Andrew Wagenmaker, Karl Pertsch, Sergey Levine

TL;DR

Generalist robot policies generalize across diverse environments but struggle when fine-tuned on limited demonstrations. RETAIN addresses this by linearly interpolating pretrained and finetuned policy weights, with extensions for co-finetuning and modality-specific merging, to preserve generalist capabilities while learning a new task. Across real-world and simulated experiments, RETAIN improves out-of-distribution generalization on the new task and retains generalist skills, with performance that scales with more pretraining data and supports continual learning of multiple skills. The approach offers a simple, effective, and scalable solution for robust finetuning in low-data regimes, enabling lifelong skill acquisition without catastrophic forgetting.

Abstract

Generalist robot policies, trained on large and diverse datasets, have demonstrated the ability to generalize across a wide spectrum of behaviors, enabling a single policy to act in varied real-world environments. However, they still fall short on new tasks not covered in the training data. When finetuned on limited demonstrations of a new task, these policies often overfit to the specific demonstrations--not only losing their prior abilities to solve a wide variety of generalist tasks but also failing to generalize within the new task itself. In this work, we aim to develop a method that preserves the generalization capabilities of the generalist policy during finetuning, allowing a single policy to robustly incorporate a new skill into its repertoire. Our goal is a single policy that both learns to generalize to variations of the new task and retains the broad competencies gained from pretraining. We show that this can be achieved through a simple yet effective strategy: interpolating the weights of a finetuned model with that of the pretrained model. We show, across extensive simulated and real-world experiments, that such model merging produces a single model that inherits the generalist abilities of the base model and learns to solve the new task robustly, outperforming both the pretrained and finetuned model on out-of-distribution variations of the new task. Moreover, we show that model merging performance scales with the amount of pretraining data, and enables continual acquisition of new skills in a lifelong learning setting, without sacrificing previously learned generalist abilities.

Robust Finetuning of Vision-Language-Action Robot Policies via Parameter Merging

TL;DR

Abstract

Robust Finetuning of Vision-Language-Action Robot Policies via Parameter Merging

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (23)