Table of Contents
Fetching ...

GeoVLA: Empowering 3D Representations in Vision-Language-Action Models

Lin Sun, Bin Xie, Yingfei Liu, Hao Shi, Tiancai Wang, Jiale Cao

TL;DR

GeoVLA addresses the limitation of 2D-only inputs in Vision-Language-Action robotics by introducing a dual-path architecture that fuses vision-language embeddings with 3D geometry via a Point Embedding Network (PEN) and a 3D-enhanced Action Expert (3DAE). The framework leverages a diffusion Transformer with Mixture-of-Experts to jointly model visual-language and geometric cues, while a static routing strategy preserves modality balance during training. Empirical results on LIBERO and ManiSkill2 demonstrate state-of-the-art performance in simulation, and real-world experiments show robust 3D perception under height, scale, and viewpoint variations. The work significantly improves spatial reasoning and manipulation reliability in VLA systems, with strong implications for autonomous robotic control in complex environments.

Abstract

Vision-Language-Action (VLA) models have emerged as a promising approach for enabling robots to follow language instructions and predict corresponding actions. However, current VLA models mainly rely on 2D visual inputs, neglecting the rich geometric information in the 3D physical world, which limits their spatial awareness and adaptability. In this paper, we present GeoVLA, a novel VLA framework that effectively integrates 3D information to advance robotic manipulation. It uses a vision-language model (VLM) to process images and language instructions,extracting fused vision-language embeddings. In parallel, it converts depth maps into point clouds and employs a customized point encoder, called Point Embedding Network, to generate 3D geometric embeddings independently. These produced embeddings are then concatenated and processed by our proposed spatial-aware action expert, called 3D-enhanced Action Expert, which combines information from different sensor modalities to produce precise action sequences. Through extensive experiments in both simulation and real-world environments, GeoVLA demonstrates superior performance and robustness. It achieves state-of-the-art results in the LIBERO and ManiSkill2 simulation benchmarks and shows remarkable robustness in real-world tasks requiring height adaptability, scale awareness and viewpoint invariance.

GeoVLA: Empowering 3D Representations in Vision-Language-Action Models

TL;DR

GeoVLA addresses the limitation of 2D-only inputs in Vision-Language-Action robotics by introducing a dual-path architecture that fuses vision-language embeddings with 3D geometry via a Point Embedding Network (PEN) and a 3D-enhanced Action Expert (3DAE). The framework leverages a diffusion Transformer with Mixture-of-Experts to jointly model visual-language and geometric cues, while a static routing strategy preserves modality balance during training. Empirical results on LIBERO and ManiSkill2 demonstrate state-of-the-art performance in simulation, and real-world experiments show robust 3D perception under height, scale, and viewpoint variations. The work significantly improves spatial reasoning and manipulation reliability in VLA systems, with strong implications for autonomous robotic control in complex environments.

Abstract

Vision-Language-Action (VLA) models have emerged as a promising approach for enabling robots to follow language instructions and predict corresponding actions. However, current VLA models mainly rely on 2D visual inputs, neglecting the rich geometric information in the 3D physical world, which limits their spatial awareness and adaptability. In this paper, we present GeoVLA, a novel VLA framework that effectively integrates 3D information to advance robotic manipulation. It uses a vision-language model (VLM) to process images and language instructions,extracting fused vision-language embeddings. In parallel, it converts depth maps into point clouds and employs a customized point encoder, called Point Embedding Network, to generate 3D geometric embeddings independently. These produced embeddings are then concatenated and processed by our proposed spatial-aware action expert, called 3D-enhanced Action Expert, which combines information from different sensor modalities to produce precise action sequences. Through extensive experiments in both simulation and real-world environments, GeoVLA demonstrates superior performance and robustness. It achieves state-of-the-art results in the LIBERO and ManiSkill2 simulation benchmarks and shows remarkable robustness in real-world tasks requiring height adaptability, scale awareness and viewpoint invariance.

Paper Structure

This paper contains 21 sections, 3 equations, 15 figures, 9 tables.

Figures (15)

  • Figure 1: GeoVLA adopts two parallel architecture shown in (a), which additionally extracts 3D geometric information from the point cloud to guide action generation. In this way, GeoVLA shows robust adaptability to height, size, and view variations in (b), and outperforms other methods in (c).
  • Figure 2: Overview of GeoVLA. RGB images with language instructions are processed by a VLM to produce vision–language features $\mathcal{F}_{VL}$, while depth maps are reprojected into point clouds and encoded by PEN as geometric features $\mathcal{F}_{P}$. Both modalities are combined in 3DAE to progressively generate robot actions.
  • Figure 3: Dual-path Point Embedding Network. In (a) Point Embedding Network processes the point cloud through two parallel paths: geometric feature path using large-kernel convolutions, and a positional encoding path leveraging RoPE to preserve 3D spatial information. In (b) only the selected $\mathcal{F}_{P}$ is send to the action export along with visual feature.
  • Figure 4: Task variation visualization. Four types of variation are conducted: (a) basket height, (b) Matryoshka doll sizes, (c) camera viewpoints, and (d) presence/absence of the sponge mat.
  • Figure 5: Simulation benchmarks. The LIBERO benchmark (a) contains various scenes and tasks, and the ManiSkill2 benchmark Pick-and-Place tasks (b) are required to pick an object to the specific location marked by a green point in a 3D space.
  • ...and 10 more figures