Table of Contents
Fetching ...

AugVLA-3D: Depth-Driven Feature Augmentation for Vision-Language-Action Models

Zhifeng Rao, Wenlong Chen, Lei Xie, Xia Hua, Dongfu Yin, Zhen Tian, F. Richard Yu

TL;DR

This work tackles the limited 3D reasoning in Vision-Language-Action models by introducing AugVLA-3D, which leverages monocular depth estimation (VGGT) to produce geometry-aware 3D features from 2D RGB data. A lightweight Action Assistant regularizer aligns these depth priors with downstream control objectives and injects information into the VLA backbone without destabilizing the pretrained 2D representations. The approach enables scalable use of large-scale 2D datasets while improving generalization in 3D-rich manipulation tasks, as demonstrated by real-world dexterous-hand experiments and RoboCasa simulations. Results show enhanced action prediction accuracy and robustness in geometrically ambiguous scenarios, highlighting depth-driven data augmentation as an effective path to bridge 2D observations and 3D-aware decision-making in robotics.

Abstract

Vision-Language-Action (VLA) models have recently achieved remarkable progress in robotic perception and control, yet most existing approaches primarily rely on VLM trained using 2D images, which limits their spatial understanding and action grounding in complex 3D environments. To address this limitation, we propose a novel framework that integrates depth estimation into VLA models to enrich 3D feature representations. Specifically, we employ a depth estimation baseline called VGGT to extract geometry-aware 3D cues from standard RGB inputs, enabling efficient utilization of existing large-scale 2D datasets while implicitly recovering 3D structural information. To further enhance the reliability of these depth-derived features, we introduce a new module called action assistant, which constrains the learned 3D representations with action priors and ensures their consistency with downstream control tasks. By fusing the enhanced 3D features with conventional 2D visual tokens, our approach significantly improves the generalization ability and robustness of VLA models. Experimental results demonstrate that the proposed method not only strengthens perception in geometrically ambiguous scenarios but also leads to superior action prediction accuracy. This work highlights the potential of depth-driven data augmentation and auxiliary expert supervision for bridging the gap between 2D observations and 3D-aware decision-making in robotic systems.

AugVLA-3D: Depth-Driven Feature Augmentation for Vision-Language-Action Models

TL;DR

This work tackles the limited 3D reasoning in Vision-Language-Action models by introducing AugVLA-3D, which leverages monocular depth estimation (VGGT) to produce geometry-aware 3D features from 2D RGB data. A lightweight Action Assistant regularizer aligns these depth priors with downstream control objectives and injects information into the VLA backbone without destabilizing the pretrained 2D representations. The approach enables scalable use of large-scale 2D datasets while improving generalization in 3D-rich manipulation tasks, as demonstrated by real-world dexterous-hand experiments and RoboCasa simulations. Results show enhanced action prediction accuracy and robustness in geometrically ambiguous scenarios, highlighting depth-driven data augmentation as an effective path to bridge 2D observations and 3D-aware decision-making in robotics.

Abstract

Vision-Language-Action (VLA) models have recently achieved remarkable progress in robotic perception and control, yet most existing approaches primarily rely on VLM trained using 2D images, which limits their spatial understanding and action grounding in complex 3D environments. To address this limitation, we propose a novel framework that integrates depth estimation into VLA models to enrich 3D feature representations. Specifically, we employ a depth estimation baseline called VGGT to extract geometry-aware 3D cues from standard RGB inputs, enabling efficient utilization of existing large-scale 2D datasets while implicitly recovering 3D structural information. To further enhance the reliability of these depth-derived features, we introduce a new module called action assistant, which constrains the learned 3D representations with action priors and ensures their consistency with downstream control tasks. By fusing the enhanced 3D features with conventional 2D visual tokens, our approach significantly improves the generalization ability and robustness of VLA models. Experimental results demonstrate that the proposed method not only strengthens perception in geometrically ambiguous scenarios but also leads to superior action prediction accuracy. This work highlights the potential of depth-driven data augmentation and auxiliary expert supervision for bridging the gap between 2D observations and 3D-aware decision-making in robotic systems.
Paper Structure (11 sections, 2 equations, 5 figures, 1 table)

This paper contains 11 sections, 2 equations, 5 figures, 1 table.

Figures (5)

  • Figure 1: The architecture comparison with different methods. (a) Gr00t bjorck2025gr00t: Only 2D visual features are used without explicit 3D reasoning. (b) PointVLA li2025pointvla: LiDAR-based point clouds are introduced but rely on specialized 3D sensors. (c) AugVLA-3D: Our AugVLA-3D leverages depth estimation to inject 3D structural features in a sensor-free manner, enabling scalable training and stronger 3D generalization.
  • Figure 2: Architecture of our proposed AugVLA-3D framework. The overall model design largely follows the GR00t backbone, while we introduce a dedicated 3D feature injection module to enhance the Action Expert with depth-derived geometric information. To ensure that the injected 3D features are aligned with task objectives without introducing excessive computational overhead, we further design an Action Assistant. This module is structurally consistent with the Action Expert but adopts a lightweight parameterization, effectively constraining the 3D features while keeping the additional cost minimal.
  • Figure 3: Illustrations of the five experimental tasks: Task 1: Place the wooden blocks into the corresponding plates; Task 2: Cover the duck toy with a tape; Task 3: Wipe the plates with a dishcloth; Task 4: Take out the cup and put in the block; Task 5: Place the cup and pour water into it.
  • Figure 4: Experimental results on real-life scenarios
  • Figure 5: Comparative experimental results between the AugVLA-3D and Gr00T models in complex manipulation scenarios with dexterous hands. The experiment involved two typical manipulation tasks: "pick up the cup and put it in a drawer, then close it" (rows 1-2) and "pick up a loaf of bread and put it in a pot" (rows 3-4). Each task consisted of six key action steps. To ensure fairness, the object layout, lighting conditions, and task instructions were kept consistent across all experimental scenarios. The results show that the AugVLA-3D model, which incorporates 3D spatial features, generally outperforms the Gr00T model in object positioning accuracy, motion trajectory smoothness, and task completion efficiency, validating the effectiveness of 3D features in improving robot manipulation intelligence.