Language-Conditioned Representations and Mixture-of-Experts Policy for Robust Multi-Task Robotic Manipulation
Xiucheng Zhang, Yang Jiang, Hongwei Qing, Jiashuo Bai
TL;DR
To address perceptual ambiguity and task conflict in multitask imitation learning for robotic manipulation, the authors introduce Language-Conditioned Visual Representations (LCVR) and a Language-conditioned Mixture-of-Experts Density Policy (LMoE-DP). LCVR grounds high-resolution visual features with language via a CLIP-based, cross-attention fusion and a lightweight Transformer to produce a language-conditioned state representation; LMoE-DP uses a sparse mixture of Gaussian MDN experts within a diffusion-based policy, with sequence-level gating and gradient modulation (FAMO) to stabilize training and enable task specialization. On real-robot benchmarks, the framework achieves a 79% average success rate, outperforming a strong baseline by 21% and delivering notable gains on multimodal action distributions and disambiguation tasks. The results demonstrate that semantic grounding paired with expert specialization yields robust, efficient multi-task manipulation suitable for real-world deployment.
Abstract
Perceptual ambiguity and task conflict limit multitask robotic manipulation via imitation learning. We propose a framework combining a Language-Conditioned Visual Representation (LCVR) module and a Language-conditioned Mixture-ofExperts Density Policy (LMoE-DP). LCVR resolves perceptual ambiguities by grounding visual features with language instructions, enabling differentiation between visually similar tasks. To mitigate task conflict, LMoE-DP uses a sparse expert architecture to specialize in distinct, multimodal action distributions, stabilized by gradient modulation. On real-robot benchmarks, LCVR boosts Action Chunking with Transformers (ACT) and Diffusion Policy (DP) success rates by 33.75% and 25%, respectively. The full framework achieves a 79% average success, outperforming the advanced baseline by 21%. Our work shows that combining semantic grounding and expert specialization enables robust, efficient multi-task manipulation
