Table of Contents
Fetching ...

Language-Grounded Decoupled Action Representation for Robotic Manipulation

Wuding Weng, Tongshu Wu, Liucheng Chen, Siyu Xie, Zheng Wang, Xing Xu, Jingkuan Song, Heng Tao Shen

Abstract

The heterogeneity between high-level vision-language understanding and low-level action control remains a fundamental challenge in robotic manipulation. Although recent methods have advanced task-specific action alignment, they often struggle to generate robust and accurate actions for novel or semantically related tasks. To address this, we propose the Language-Grounded Decoupled Action Representation (LaDA) framework, which leverages natural language as a semantic bridge to connect perception and control. LaDA introduces a fine-grained intermediate layer of three interpretable action primitives--translation, rotation, and gripper control--providing explicit semantic structure for low-level actions. It further employs a semantic-guided soft-label contrastive learning objective to align similar action primitives across tasks, enhancing generalization and motion consistency. An adaptive weighting strategy, inspired by curriculum learning, dynamically balances contrastive and imitation objectives for stable and effective training. Extensive experiments on simulated benchmarks (LIBERO and MimicGen) and real-world demonstrations validate that LaDA achieves strong performance and generalizes effectively to unseen or related tasks.

Language-Grounded Decoupled Action Representation for Robotic Manipulation

Abstract

The heterogeneity between high-level vision-language understanding and low-level action control remains a fundamental challenge in robotic manipulation. Although recent methods have advanced task-specific action alignment, they often struggle to generate robust and accurate actions for novel or semantically related tasks. To address this, we propose the Language-Grounded Decoupled Action Representation (LaDA) framework, which leverages natural language as a semantic bridge to connect perception and control. LaDA introduces a fine-grained intermediate layer of three interpretable action primitives--translation, rotation, and gripper control--providing explicit semantic structure for low-level actions. It further employs a semantic-guided soft-label contrastive learning objective to align similar action primitives across tasks, enhancing generalization and motion consistency. An adaptive weighting strategy, inspired by curriculum learning, dynamically balances contrastive and imitation objectives for stable and effective training. Extensive experiments on simulated benchmarks (LIBERO and MimicGen) and real-world demonstrations validate that LaDA achieves strong performance and generalizes effectively to unseen or related tasks.
Paper Structure (21 sections, 6 equations, 7 figures, 3 tables)

This paper contains 21 sections, 6 equations, 7 figures, 3 tables.

Figures (7)

  • Figure 1: Comparison of representative paradigms in vision-language-action learning. Existing approaches either entangle perception and control in end-to-end VLAs, rely on latent action embeddings without explicit semantics, or use discrete language-conditioned primitives that lack fine-grained motion grounding. Our LaDA framework bridges this gap by leveraging language as a semantic bridge to decouple and align vision, language, and action representations through soft-label contrastive learning, enabling semantically grounded and generalizable manipulation.
  • Figure 2: Overview of the proposed LaDA framework. LaDA leverages language as a semantic bridge to connect high-level vision–language understanding with low-level control. It decomposes continuous 7-DoF end-effector actions into interpretable primitives—translation, rotation, and gripper control—and encodes them within a shared semantic embedding space. Semantic-guided soft-label contrastive learning aligns multimodal representations across tasks, while an adaptive weighting strategy dynamically balances imitation and contrastive objectives, enabling efficient cross-task transfer and robust generalization.
  • Figure 3: Example tasks from the simulation environments. (Top) Sample tasks from LIBERO benchmark liu2023libero, illustrating language-conditioned manipulation scenarios. (Bottom) Example tasks from MimicGen mandlekar2023mimicgen, demonstrating contact-rich manipulation skills
  • Figure 4: Generalization evaluation of LaDA on novel and semantically related tasks.
  • Figure 5: Comparison of average success rates on MimicGen tasks (Stack, StackThree, Threading) under single-task and multi-task training. LaDA consistently outperforms CLIP-RT, with larger gains in the multi-task setting.
  • ...and 2 more figures