UniLACT: Depth-Aware RGB Latent Action Learning for Vision-Language-Action Models

Manish Kumar Govind; Dominick Reilly; Pu Wang; Srijan Das

UniLACT: Depth-Aware RGB Latent Action Learning for Vision-Language-Action Models

Manish Kumar Govind, Dominick Reilly, Pu Wang, Srijan Das

TL;DR

This work introduces UniLACT, a transformer-based VLA model that incorporates geometric structure through depth-aware latent pretraining, enabling downstream policies to inherit stronger spatial priors, and proposes UniLARN, a unified latent action learning framework that learns a shared embedding space for RGB and depth while explicitly modeling their cross-modal interactions.

Abstract

Latent action representations learned from unlabeled videos have recently emerged as a promising paradigm for pretraining vision-language-action (VLA) models without explicit robot action supervision. However, latent actions derived solely from RGB observations primarily encode appearance-driven dynamics and lack explicit 3D geometric structure, which is essential for precise and contact-rich manipulation. To address this limitation, we introduce UniLACT, a transformer-based VLA model that incorporates geometric structure through depth-aware latent pretraining, enabling downstream policies to inherit stronger spatial priors. To facilitate this process, we propose UniLARN, a unified latent action learning framework based on inverse and forward dynamics objectives that learns a shared embedding space for RGB and depth while explicitly modeling their cross-modal interactions. This formulation produces modality-specific and unified latent action representations that serve as pseudo-labels for the depth-aware pretraining of UniLACT. Extensive experiments in both simulation and real-world settings demonstrate the effectiveness of depth-aware unified latent action representations. UniLACT consistently outperforms RGB-based latent action baselines under in-domain and out-of-domain pretraining regimes, as well as on both seen and unseen manipulation tasks.

UniLACT: Depth-Aware RGB Latent Action Learning for Vision-Language-Action Models

TL;DR

Abstract

Paper Structure (16 sections, 10 equations, 5 figures, 6 tables)

This paper contains 16 sections, 10 equations, 5 figures, 6 tables.

INTRODUCTION
RELATED WORK
Vision-Language-Action Models.
Latent Action Representations.
Enhancing VLAs with Depth.
Proposed Method
UniLARN: Unified Latent Action Learning
Unified Latent Pretraining.
Action Fine-Tuning
Experiments
Implementation details
CALVIN Experiments
Real-World Experiments
Computational Analysis
Ablation studies
...and 1 more sections

Figures (5)

Figure 1: Overview of UniLACT's three stages: (1) UniLARN learns modality-specific (RGB/depth) and unfied discrete latent actions from pairs of RGB-D frames within a shared latent space. (2) UniLACT is pretrained with cross-modal autoregressive latent-token prediction to capture complementary priors from RGB appearance and depth geometry. (3) UniLACT is fine-tuned on action-labeled trajectories to map predicted latent tokens to executable robot actions.
Figure 2: Task-wise success comparison on CALVIN between RGB and unified latent action representations.Top: tasks where RGB-based latents perform better; Bottom: tasks where unified latents(RGB+depth) perform better.
Figure 3: Real-world experimental setup. The setup consists of an xArm7 manipulator equipped with a two-fingered parallel gripper and a workspace-facing camera mounted on the table.
Figure 4: Illustration of Task T1: "Pick up the carrot and place it in the bowl." Top row: Moto fails to place the carrot inside the bowl and pushes the bowl out of the workspace. Bottom row:UniLACT successfully completes the task.
Figure 5: Illustration of Task T2: "Move the eggplant near the banana." Top row: Moto approaches the eggplant but fails to grasp it and collides with the workspace. Bottom row:UniLACT successfully grasps the eggplant and moves it near the banana.

UniLACT: Depth-Aware RGB Latent Action Learning for Vision-Language-Action Models

TL;DR

Abstract

UniLACT: Depth-Aware RGB Latent Action Learning for Vision-Language-Action Models

Authors

TL;DR

Abstract

Table of Contents

Figures (5)