GeoAware-VLA: Implicit Geometry Aware Vision-Language-Action Model

Ali Abouzeid; Malak Mansour; Zezhou Sun; Dezhen Song

GeoAware-VLA: Implicit Geometry Aware Vision-Language-Action Model

Ali Abouzeid, Malak Mansour, Zezhou Sun, Dezhen Song

TL;DR

This work tackles the poor viewpoint generalization of Vision-Language-Action (VLA) policies by injecting a strong geometric prior via a frozen geometric foundation model, VGGT, as the visual backbone. A lightweight trainable projection layer maps VGGT features into the policy’s latent space, enabling the GPT-style decoder to generate actions without the policy learning 3D geometry from scratch. Empirical results on the LIBERO benchmark show over 2x improvements in zero-shot generalization to novel camera poses, and successful transfer to a real robot, applicable to both continuous and discrete action heads. The approach provides a practical, computation-efficient blueprint for leveraging geometric foundation models to enhance robustness and generalization in robotic agents.

Abstract

Vision-Language-Action (VLA) models often fail to generalize to novel camera viewpoints, a limitation stemming from their difficulty in inferring robust 3D geometry from 2D images. We introduce GeoAware-VLA, a simple yet effective approach that enhances viewpoint invariance by integrating strong geometric priors into the vision backbone. Instead of training a visual encoder or relying on explicit 3D data, we leverage a frozen, pretrained geometric vision model as a feature extractor. A trainable projection layer then adapts these geometrically-rich features for the policy decoder, relieving it of the burden of learning 3D consistency from scratch. Through extensive evaluations on LIBERO benchmark subsets, we show GeoAware-VLA achieves substantial improvements in zero-shot generalization to novel camera poses, boosting success rates by over 2x in simulation. Crucially, these benefits translate to the physical world; our model shows a significant performance gain on a real robot, especially when evaluated from unseen camera angles. Our approach proves effective across both continuous and discrete action spaces, highlighting that robust geometric grounding is a key component for creating more generalizable robotic agents.

GeoAware-VLA: Implicit Geometry Aware Vision-Language-Action Model

TL;DR

Abstract

GeoAware-VLA: Implicit Geometry Aware Vision-Language-Action Model

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (5)