Table of Contents
Fetching ...

Grounding Robot Generalization in Training Data via Retrieval-Augmented VLMs

Jensen Gao, Dorsa Sadigh, Sandy Huang, Dhruv Shah

Abstract

Recent work on robot manipulation has advanced policy generalization to novel scenarios. However, it is often difficult to characterize how different evaluation settings actually represent generalization from the training distribution of a given policy. To work towards more precise evaluation of generalization in robotics, we propose RADAR, a scalable framework for directly comparing test-time evaluation tasks to policy training data, to determine what form of policy generalization is required. RADAR consists of a two-stage pipeline: first, retrieval using generalist policy embeddings identifies which training examples are relevant for a given evaluation task. Next, vision-language models (VLMs) analyze the evaluation task against the retrieved data, outputting interpretable analysis on how they compare along a variety of axes, and an overall classification of what type of policy generalization is required. Through controlled experiments, we demonstrate that VLMs are effective at analyzing data for generalization, and that our retrieval step effectively identifies examples needed to make accurate classifications with respect to the training data. Furthermore, we scale RADAR to large-scale datasets, where we observe agreement with human-defined benchmark conditions from prior work. We provide demonstrations at radar-analysis.github.io.

Grounding Robot Generalization in Training Data via Retrieval-Augmented VLMs

Abstract

Recent work on robot manipulation has advanced policy generalization to novel scenarios. However, it is often difficult to characterize how different evaluation settings actually represent generalization from the training distribution of a given policy. To work towards more precise evaluation of generalization in robotics, we propose RADAR, a scalable framework for directly comparing test-time evaluation tasks to policy training data, to determine what form of policy generalization is required. RADAR consists of a two-stage pipeline: first, retrieval using generalist policy embeddings identifies which training examples are relevant for a given evaluation task. Next, vision-language models (VLMs) analyze the evaluation task against the retrieved data, outputting interpretable analysis on how they compare along a variety of axes, and an overall classification of what type of policy generalization is required. Through controlled experiments, we demonstrate that VLMs are effective at analyzing data for generalization, and that our retrieval step effectively identifies examples needed to make accurate classifications with respect to the training data. Furthermore, we scale RADAR to large-scale datasets, where we observe agreement with human-defined benchmark conditions from prior work. We provide demonstrations at radar-analysis.github.io.
Paper Structure (16 sections, 1 equation, 12 figures, 3 tables)

This paper contains 16 sections, 1 equation, 12 figures, 3 tables.

Figures (12)

  • Figure 1: We outline the two-stage pipeline of RADAR. Given an evaluation task, we first identify relevant examples from training data using nearest neighbors with embeddings from a generalist robot policy. Next, we use vision-language models to analyze the evaluation task against the retrieved examples to categorize what generalization it represents with respect to the training data (in-distribution, visual, or behavioral).
  • Figure 2: (Left) Test example $\tau^{\text{test}}$ for the task "put the lemon on the plate". (Right) If $d^*$ (e.g., minor lighting change) is found in $\mathcal{D}$, then $\tau^{\text{test}}$ is in-distribution. Otherwise, if $d^v$ (e.g., distractor objects) is found in $\mathcal{D}$, then this is visual generalization. If neither case applies, then all $d \in \mathcal{D}$ involve different optimal behavior than $\tau^{\text{test}}$ (e.g., changed object poses), and this is behavioral generalization.
  • Figure 3: We visualize the three task families in our controlled experiments (Pick-and-Place, Unzip Lunchbag, and Fold Dress). We provide an example instance for each task (left), and variations of that task across different axes (right), with the specific change noted in parentheses.
  • Figure 4: Recall rate scaling ($y$-axis) of $d^*$ for $\mathcal{D}_{\text{in-dist}}$ as the size of the retrieval set ($x$-axis) increases. We find that VLA-based embeddings and DINOv3 significantly outperform the other retrieval methods.
  • Figure 5: Recall rate scaling ($y$-axis) of $d^v$ for $\mathcal{D}_{\text{visual}}$ as the size of the retrieval set ($x$-axis) increases. $\pi_{0.5}$ and GROD outperform their counterparts trained on less robot data ($\pi_{0}$, GROD (25% Data)), and are much better than the baselines that do not use robot data.
  • ...and 7 more figures