Table of Contents
Fetching ...

Language-Conditioned Representations and Mixture-of-Experts Policy for Robust Multi-Task Robotic Manipulation

Xiucheng Zhang, Yang Jiang, Hongwei Qing, Jiashuo Bai

TL;DR

To address perceptual ambiguity and task conflict in multitask imitation learning for robotic manipulation, the authors introduce Language-Conditioned Visual Representations (LCVR) and a Language-conditioned Mixture-of-Experts Density Policy (LMoE-DP). LCVR grounds high-resolution visual features with language via a CLIP-based, cross-attention fusion and a lightweight Transformer to produce a language-conditioned state representation; LMoE-DP uses a sparse mixture of Gaussian MDN experts within a diffusion-based policy, with sequence-level gating and gradient modulation (FAMO) to stabilize training and enable task specialization. On real-robot benchmarks, the framework achieves a 79% average success rate, outperforming a strong baseline by 21% and delivering notable gains on multimodal action distributions and disambiguation tasks. The results demonstrate that semantic grounding paired with expert specialization yields robust, efficient multi-task manipulation suitable for real-world deployment.

Abstract

Perceptual ambiguity and task conflict limit multitask robotic manipulation via imitation learning. We propose a framework combining a Language-Conditioned Visual Representation (LCVR) module and a Language-conditioned Mixture-ofExperts Density Policy (LMoE-DP). LCVR resolves perceptual ambiguities by grounding visual features with language instructions, enabling differentiation between visually similar tasks. To mitigate task conflict, LMoE-DP uses a sparse expert architecture to specialize in distinct, multimodal action distributions, stabilized by gradient modulation. On real-robot benchmarks, LCVR boosts Action Chunking with Transformers (ACT) and Diffusion Policy (DP) success rates by 33.75% and 25%, respectively. The full framework achieves a 79% average success, outperforming the advanced baseline by 21%. Our work shows that combining semantic grounding and expert specialization enables robust, efficient multi-task manipulation

Language-Conditioned Representations and Mixture-of-Experts Policy for Robust Multi-Task Robotic Manipulation

TL;DR

To address perceptual ambiguity and task conflict in multitask imitation learning for robotic manipulation, the authors introduce Language-Conditioned Visual Representations (LCVR) and a Language-conditioned Mixture-of-Experts Density Policy (LMoE-DP). LCVR grounds high-resolution visual features with language via a CLIP-based, cross-attention fusion and a lightweight Transformer to produce a language-conditioned state representation; LMoE-DP uses a sparse mixture of Gaussian MDN experts within a diffusion-based policy, with sequence-level gating and gradient modulation (FAMO) to stabilize training and enable task specialization. On real-robot benchmarks, the framework achieves a 79% average success rate, outperforming a strong baseline by 21% and delivering notable gains on multimodal action distributions and disambiguation tasks. The results demonstrate that semantic grounding paired with expert specialization yields robust, efficient multi-task manipulation suitable for real-world deployment.

Abstract

Perceptual ambiguity and task conflict limit multitask robotic manipulation via imitation learning. We propose a framework combining a Language-Conditioned Visual Representation (LCVR) module and a Language-conditioned Mixture-ofExperts Density Policy (LMoE-DP). LCVR resolves perceptual ambiguities by grounding visual features with language instructions, enabling differentiation between visually similar tasks. To mitigate task conflict, LMoE-DP uses a sparse expert architecture to specialize in distinct, multimodal action distributions, stabilized by gradient modulation. On real-robot benchmarks, LCVR boosts Action Chunking with Transformers (ACT) and Diffusion Policy (DP) success rates by 33.75% and 25%, respectively. The full framework achieves a 79% average success, outperforming the advanced baseline by 21%. Our work shows that combining semantic grounding and expert specialization enables robust, efficient multi-task manipulation

Paper Structure

This paper contains 15 sections, 7 equations, 6 figures, 5 tables.

Figures (6)

  • Figure 1: The LCVR module architecture. From a high-resolution image, nine local and one global patch are extracted and then encoded by a shared, general-purpose pre-trained and frozen CLIP ViT-B/16(https://huggingface.co/openai/clip-vit-base-patch16). The resulting patch tokens are fused via cross-attention (global as Q, local as K/V) to synthesize a unified visual feature. This feature is then concatenated with an embedding from the language instruction and processed by a lightweight Transformer to produce the final language-conditioned representation, $z_{\text{LCVR}}$, for the downstream LMoE-DP policy.
  • Figure 2: An overview of the LMoE-DP architecture. The policy is a conditional diffusion model. A Transformer backbone, conditioned on the language-visual feature $z_{\text{LCVR}}$, robot state, and diffusion timestep $k$, processes the noisy action sequence $A^k$ to produce a feature representation $X_{\text{feat}}$. A sequence-level gating network routes this representation to a bank of specialized MDN experts using a Top-2 strategy during training and a Top-1 strategy for inference. Each active MDN expert parameterizes a Gaussian Mixture Model (GMM) to predict the noise $\epsilon_{\text{pred}}$, which is used by the DDIM scheduler to iteratively compute the cleaner action $A^{k-1}$.
  • Figure 3: Real-World Experimental Platform
  • Figure 4: The four manipulation tasks used for evaluation. Target objects are indicated by red circles.
  • Figure 5: Illustration of the 5 tasks and their 9 variants, including the specific language templates used for each task.
  • ...and 1 more figures