Towards Generalizable Vision-Language Robotic Manipulation: A Benchmark and LLM-guided 3D Policy

Ricardo Garcia; Shizhe Chen; Cordelia Schmid

Towards Generalizable Vision-Language Robotic Manipulation: A Benchmark and LLM-guided 3D Policy

Ricardo Garcia, Shizhe Chen, Cordelia Schmid

TL;DR

GemBench introduces a formal benchmark to evaluate generalization in vision-language robotic manipulation across four progressive levels, spanning novel placements, rigid and articulated objects, and long-horizon tasks. It couples a strong 3D-vision policy, 3D-LOTUS, with a modular 3D-LOTUS++ framework that leverages LLMs for task planning and VLMs for object grounding, to achieve robust generalization to unseen tasks. Experimental results on RLBench and GemBench show state-of-the-art performance for both seen and novel tasks, while ablations identify grounding and long-horizon control as key bottlenecks. The work advances practical generalization in robotic manipulation and provides a reusable benchmark, models, and code for future research and deployment.

Abstract

Generalizing language-conditioned robotic policies to new tasks remains a significant challenge, hampered by the lack of suitable simulation benchmarks. In this paper, we address this gap by introducing GemBench, a novel benchmark to assess generalization capabilities of vision-language robotic manipulation policies. GemBench incorporates seven general action primitives and four levels of generalization, spanning novel placements, rigid and articulated objects, and complex long-horizon tasks. We evaluate state-of-the-art approaches on GemBench and also introduce a new method. Our approach 3D-LOTUS leverages rich 3D information for action prediction conditioned on language. While 3D-LOTUS excels in both efficiency and performance on seen tasks, it struggles with novel tasks. To address this, we present 3D-LOTUS++, a framework that integrates 3D-LOTUS's motion planning capabilities with the task planning capabilities of LLMs and the object grounding accuracy of VLMs. 3D-LOTUS++ achieves state-of-the-art performance on novel tasks of GemBench, setting a new standard for generalization in robotic manipulation. The benchmark, codes and trained models are available at https://www.di.ens.fr/willow/research/gembench/.

Towards Generalizable Vision-Language Robotic Manipulation: A Benchmark and LLM-guided 3D Policy

TL;DR

Abstract

Paper Structure (25 sections, 1 equation, 8 figures, 13 tables)

This paper contains 25 sections, 1 equation, 8 figures, 13 tables.

Introduction
Related work
GEMBench: GEneralizable Vision-Language Robotic Manipulation Benchmark
Training tasks
Testing tasks with four levels of generalization
Method
Method
Problem formulation
3D-LOTUS policy
3D-LOTUS++ policy
Experiments
Experimental setup
Comparison with state of the arts
Ablations
Real world experiments
...and 10 more sections

Figures (8)

Figure 1: GemBench benchmark for vision-language robotic manipulation. Top: GemBench comprises 16 training tasks with 31 variations, covering seven action primitives. Bottom: The testing set includes 44 tasks with 92 variations, which are organized into four progressively more challenging levels to systematically evaluate generalization capabilities.
Figure 2: Overview of 3D-LOTUS++ framework. It leverages generalization capabilities of foundation models for planning and perception, and strong action execution ability of 3D-LOTUS to perform complex tasks.
Figure 3: 3D-LOTUS architecture. It takes point cloud and text as input to predict the next action.
Figure 4: Real robot tasks variations. The top row illustrates task variations used for model training. The bottom row presents new task variations to assess model's generalization capabilities on the real robot.
Figure 5: Automatic point removal. We use geometry information to automatically filter out irrelevant points in the scene.
...and 3 more figures

Towards Generalizable Vision-Language Robotic Manipulation: A Benchmark and LLM-guided 3D Policy

TL;DR

Abstract

Towards Generalizable Vision-Language Robotic Manipulation: A Benchmark and LLM-guided 3D Policy

Authors

TL;DR

Abstract

Table of Contents

Figures (8)