Table of Contents
Fetching ...

ReVLA: Reverting Visual Domain Limitation of Robotic Foundation Models

Sombit Dey, Jan-Nico Zaech, Nikolay Nikolov, Luc Van Gool, Danda Pani Paudel

TL;DR

Problem: Visual generalization gaps in robotic foundation models due to vision encoder adaptation. Approach: ReVLA, a gradual linear-model merging method, reverts vision backbones to pretrained weights during training while keeping the robotic training protocol. Findings: ReVLA improves out-of-domain visual robustness by up to ~77% relative (12.5 percentage points) and maintains or enhances in-domain performance on SIMPLER Visual Matching. Significance: provides a practical route to restore generalist perception in robotic systems and highlights the value of preserving diverse visual representations during training.

Abstract

Recent progress in large language models and access to large-scale robotic datasets has sparked a paradigm shift in robotics models transforming them into generalists able to adapt to various tasks, scenes, and robot modalities. A large step for the community are open Vision Language Action models which showcase strong performance in a wide variety of tasks. In this work, we study the visual generalization capabilities of three existing robotic foundation models, and propose a corresponding evaluation framework. Our study shows that the existing models do not exhibit robustness to visual out-of-domain scenarios. This is potentially caused by limited variations in the training data and/or catastrophic forgetting, leading to domain limitations in the vision foundation models. We further explore OpenVLA, which uses two pre-trained vision foundation models and is, therefore, expected to generalize to out-of-domain experiments. However, we showcase catastrophic forgetting by DINO-v2 in OpenVLA through its failure to fulfill the task of depth regression. To overcome the aforementioned issue of visual catastrophic forgetting, we propose a gradual backbone reversal approach founded on model merging. This enables OpenVLA -- which requires the adaptation of the visual backbones during initial training -- to regain its visual generalization ability. Regaining this capability enables our ReVLA model to improve over OpenVLA by a factor of 77\% and 66\% for grasping and lifting in visual OOD tasks. Comprehensive evaluations, episode rollouts and model weights are available on the ReVLA Page

ReVLA: Reverting Visual Domain Limitation of Robotic Foundation Models

TL;DR

Problem: Visual generalization gaps in robotic foundation models due to vision encoder adaptation. Approach: ReVLA, a gradual linear-model merging method, reverts vision backbones to pretrained weights during training while keeping the robotic training protocol. Findings: ReVLA improves out-of-domain visual robustness by up to ~77% relative (12.5 percentage points) and maintains or enhances in-domain performance on SIMPLER Visual Matching. Significance: provides a practical route to restore generalist perception in robotic systems and highlights the value of preserving diverse visual representations during training.

Abstract

Recent progress in large language models and access to large-scale robotic datasets has sparked a paradigm shift in robotics models transforming them into generalists able to adapt to various tasks, scenes, and robot modalities. A large step for the community are open Vision Language Action models which showcase strong performance in a wide variety of tasks. In this work, we study the visual generalization capabilities of three existing robotic foundation models, and propose a corresponding evaluation framework. Our study shows that the existing models do not exhibit robustness to visual out-of-domain scenarios. This is potentially caused by limited variations in the training data and/or catastrophic forgetting, leading to domain limitations in the vision foundation models. We further explore OpenVLA, which uses two pre-trained vision foundation models and is, therefore, expected to generalize to out-of-domain experiments. However, we showcase catastrophic forgetting by DINO-v2 in OpenVLA through its failure to fulfill the task of depth regression. To overcome the aforementioned issue of visual catastrophic forgetting, we propose a gradual backbone reversal approach founded on model merging. This enables OpenVLA -- which requires the adaptation of the visual backbones during initial training -- to regain its visual generalization ability. Regaining this capability enables our ReVLA model to improve over OpenVLA by a factor of 77\% and 66\% for grasping and lifting in visual OOD tasks. Comprehensive evaluations, episode rollouts and model weights are available on the ReVLA Page
Paper Structure (17 sections, 1 equation, 3 figures, 3 tables)

This paper contains 17 sections, 1 equation, 3 figures, 3 tables.

Figures (3)

  • Figure 1: The OpenVLA model tuned on fractal data struggles with out-of-domain objects and in the presence of distractors (left), due to catastrophic forgetting in its DINO-v2 and SigLIP vision encoders. Our ReVLA (DS gradual) model addresses this by reverting the vision encoders to their original pre-trained weights, leading to improved overall performance across all three settings (right).
  • Figure 2: Depth regression from original DINO-v2 (top-right) and OpenVLA DINO-v2 (bottom-right) using DPT head (a) and linear probing (b). Each row: input image, ground-truth, predicted depth. The original DINO-v2 estimates depths correctly, whereas OpenVLA DINO-v2 performs poorly due to catastrophic forgetting during OpenVLA training.
  • Figure 3: Left: Openvla and ReVLA are able to identify the target object. Openvla fails to grasp it. Right: OpenVLA tries to grasp the Fanta can, which is nearest to the arm. ReVLA initially hesitates but eventually grasps the tomato can.