Table of Contents
Fetching ...

UniLACT: Depth-Aware RGB Latent Action Learning for Vision-Language-Action Models

Manish Kumar Govind, Dominick Reilly, Pu Wang, Srijan Das

TL;DR

This work introduces UniLACT, a transformer-based VLA model that incorporates geometric structure through depth-aware latent pretraining, enabling downstream policies to inherit stronger spatial priors, and proposes UniLARN, a unified latent action learning framework that learns a shared embedding space for RGB and depth while explicitly modeling their cross-modal interactions.

Abstract

Latent action representations learned from unlabeled videos have recently emerged as a promising paradigm for pretraining vision-language-action (VLA) models without explicit robot action supervision. However, latent actions derived solely from RGB observations primarily encode appearance-driven dynamics and lack explicit 3D geometric structure, which is essential for precise and contact-rich manipulation. To address this limitation, we introduce UniLACT, a transformer-based VLA model that incorporates geometric structure through depth-aware latent pretraining, enabling downstream policies to inherit stronger spatial priors. To facilitate this process, we propose UniLARN, a unified latent action learning framework based on inverse and forward dynamics objectives that learns a shared embedding space for RGB and depth while explicitly modeling their cross-modal interactions. This formulation produces modality-specific and unified latent action representations that serve as pseudo-labels for the depth-aware pretraining of UniLACT. Extensive experiments in both simulation and real-world settings demonstrate the effectiveness of depth-aware unified latent action representations. UniLACT consistently outperforms RGB-based latent action baselines under in-domain and out-of-domain pretraining regimes, as well as on both seen and unseen manipulation tasks.

UniLACT: Depth-Aware RGB Latent Action Learning for Vision-Language-Action Models

TL;DR

This work introduces UniLACT, a transformer-based VLA model that incorporates geometric structure through depth-aware latent pretraining, enabling downstream policies to inherit stronger spatial priors, and proposes UniLARN, a unified latent action learning framework that learns a shared embedding space for RGB and depth while explicitly modeling their cross-modal interactions.

Abstract

Latent action representations learned from unlabeled videos have recently emerged as a promising paradigm for pretraining vision-language-action (VLA) models without explicit robot action supervision. However, latent actions derived solely from RGB observations primarily encode appearance-driven dynamics and lack explicit 3D geometric structure, which is essential for precise and contact-rich manipulation. To address this limitation, we introduce UniLACT, a transformer-based VLA model that incorporates geometric structure through depth-aware latent pretraining, enabling downstream policies to inherit stronger spatial priors. To facilitate this process, we propose UniLARN, a unified latent action learning framework based on inverse and forward dynamics objectives that learns a shared embedding space for RGB and depth while explicitly modeling their cross-modal interactions. This formulation produces modality-specific and unified latent action representations that serve as pseudo-labels for the depth-aware pretraining of UniLACT. Extensive experiments in both simulation and real-world settings demonstrate the effectiveness of depth-aware unified latent action representations. UniLACT consistently outperforms RGB-based latent action baselines under in-domain and out-of-domain pretraining regimes, as well as on both seen and unseen manipulation tasks.
Paper Structure (16 sections, 10 equations, 5 figures, 6 tables)

This paper contains 16 sections, 10 equations, 5 figures, 6 tables.

Figures (5)

  • Figure 1: Overview of UniLACT's three stages: (1) UniLARN learns modality-specific (RGB/depth) and unfied discrete latent actions from pairs of RGB-D frames within a shared latent space. (2) UniLACT is pretrained with cross-modal autoregressive latent-token prediction to capture complementary priors from RGB appearance and depth geometry. (3) UniLACT is fine-tuned on action-labeled trajectories to map predicted latent tokens to executable robot actions.
  • Figure 2: Task-wise success comparison on CALVIN between RGB and unified latent action representations.Top: tasks where RGB-based latents perform better; Bottom: tasks where unified latents(RGB+depth) perform better.
  • Figure 3: Real-world experimental setup. The setup consists of an xArm7 manipulator equipped with a two-fingered parallel gripper and a workspace-facing camera mounted on the table.
  • Figure 4: Illustration of Task T1: "Pick up the carrot and place it in the bowl." Top row: Moto fails to place the carrot inside the bowl and pushes the bowl out of the workspace. Bottom row:UniLACT successfully completes the task.
  • Figure 5: Illustration of Task T2: "Move the eggplant near the banana." Top row: Moto approaches the eggplant but fails to grasp it and collides with the workspace. Bottom row:UniLACT successfully grasps the eggplant and moves it near the banana.