Table of Contents
Fetching ...

GeoAware-VLA: Implicit Geometry Aware Vision-Language-Action Model

Ali Abouzeid, Malak Mansour, Zezhou Sun, Dezhen Song

TL;DR

This work tackles the poor viewpoint generalization of Vision-Language-Action (VLA) policies by injecting a strong geometric prior via a frozen geometric foundation model, VGGT, as the visual backbone. A lightweight trainable projection layer maps VGGT features into the policy’s latent space, enabling the GPT-style decoder to generate actions without the policy learning 3D geometry from scratch. Empirical results on the LIBERO benchmark show over 2x improvements in zero-shot generalization to novel camera poses, and successful transfer to a real robot, applicable to both continuous and discrete action heads. The approach provides a practical, computation-efficient blueprint for leveraging geometric foundation models to enhance robustness and generalization in robotic agents.

Abstract

Vision-Language-Action (VLA) models often fail to generalize to novel camera viewpoints, a limitation stemming from their difficulty in inferring robust 3D geometry from 2D images. We introduce GeoAware-VLA, a simple yet effective approach that enhances viewpoint invariance by integrating strong geometric priors into the vision backbone. Instead of training a visual encoder or relying on explicit 3D data, we leverage a frozen, pretrained geometric vision model as a feature extractor. A trainable projection layer then adapts these geometrically-rich features for the policy decoder, relieving it of the burden of learning 3D consistency from scratch. Through extensive evaluations on LIBERO benchmark subsets, we show GeoAware-VLA achieves substantial improvements in zero-shot generalization to novel camera poses, boosting success rates by over 2x in simulation. Crucially, these benefits translate to the physical world; our model shows a significant performance gain on a real robot, especially when evaluated from unseen camera angles. Our approach proves effective across both continuous and discrete action spaces, highlighting that robust geometric grounding is a key component for creating more generalizable robotic agents.

GeoAware-VLA: Implicit Geometry Aware Vision-Language-Action Model

TL;DR

This work tackles the poor viewpoint generalization of Vision-Language-Action (VLA) policies by injecting a strong geometric prior via a frozen geometric foundation model, VGGT, as the visual backbone. A lightweight trainable projection layer maps VGGT features into the policy’s latent space, enabling the GPT-style decoder to generate actions without the policy learning 3D geometry from scratch. Empirical results on the LIBERO benchmark show over 2x improvements in zero-shot generalization to novel camera poses, and successful transfer to a real robot, applicable to both continuous and discrete action heads. The approach provides a practical, computation-efficient blueprint for leveraging geometric foundation models to enhance robustness and generalization in robotic agents.

Abstract

Vision-Language-Action (VLA) models often fail to generalize to novel camera viewpoints, a limitation stemming from their difficulty in inferring robust 3D geometry from 2D images. We introduce GeoAware-VLA, a simple yet effective approach that enhances viewpoint invariance by integrating strong geometric priors into the vision backbone. Instead of training a visual encoder or relying on explicit 3D data, we leverage a frozen, pretrained geometric vision model as a feature extractor. A trainable projection layer then adapts these geometrically-rich features for the policy decoder, relieving it of the burden of learning 3D consistency from scratch. Through extensive evaluations on LIBERO benchmark subsets, we show GeoAware-VLA achieves substantial improvements in zero-shot generalization to novel camera poses, boosting success rates by over 2x in simulation. Crucially, these benefits translate to the physical world; our model shows a significant performance gain on a real robot, especially when evaluated from unseen camera angles. Our approach proves effective across both continuous and discrete action spaces, highlighting that robust geometric grounding is a key component for creating more generalizable robotic agents.

Paper Structure

This paper contains 23 sections, 4 equations, 5 figures, 3 tables.

Figures (5)

  • Figure 1: (Top) Illustration of the training and testing viewpoints in the LIBERO dataset. GeoAware-VLA demonstrates zero-shot generalization across novel viewpoints. (Mid) GeoAware-VLA achieves superior generalization to novel viewpoints, outperforming the state-of-the-art by over 30% in success rate on the LIBERO dataset. (Bottom) Key intermediate steps during successful real-world deployment.
  • Figure 2: The overall architecture diagram of GeoAware-VLA. It inputs multi-view images, uses VGGT to extract view robust features to generate robot actions.
  • Figure 3: A visual representation of a single episode from our experimental setup, which shows how different viewpoints (rows) change over time (left to right columns). The top row shows the wrist camera images, while the second row displays the top-down viewpoint. The model is trained exclusively on these two viewpoints. The bottom three rows show three novel, unseen viewpoints used for evaluating the model
  • Figure 4: Real-Robot Setup. (a) Illustration of Hardware. (b) The viewpoint used to train all policies. (c) The viewpoint used for our zero-shot evaluation experiments.
  • Figure 5: (Top) A plot comparing the real-world performance of our GeoAware-VLA against the baseline, highlighting our model's improvement on both seen and novel viewpoints. (Bottom)The initial and final states for the real-world tasks evaluated.