Table of Contents
Fetching ...

MoRE: Unlocking Scalability in Reinforcement Learning for Quadruped Vision-Language-Action Models

Han Zhao, Wenxuan Song, Donglin Wang, Xinyang Tong, Pengxiang Ding, Xuelian Cheng, Zongyuan Ge

TL;DR

MoRE introduces a scalable approach to learning quadruped vision-language-action controllers by embedding a sparse mixture of LoRA experts within a dense multimodal transformer and optimizing with an offline RL objective as a Q-function. By leveraging mixed-quality data (expert and sub-optimal trajectories), it achieves data-efficient, multi-task policy learning. In simulation and real-world experiments, MoRE outperforms baselines across six skills and demonstrates robust generalization to unseen scenarios. This work advances multi-task learning in quadruped robotics by fusing MoE-based adaptation with RL fine-tuning of VLA models using mixed data.

Abstract

Developing versatile quadruped robots that can smoothly perform various actions and tasks in real-world environments remains a significant challenge. This paper introduces a novel vision-language-action (VLA) model, mixture of robotic experts (MoRE), for quadruped robots that aim to introduce reinforcement learning (RL) for fine-tuning large-scale VLA models with a large amount of mixed-quality data. MoRE integrates multiple low-rank adaptation modules as distinct experts within a dense multi-modal large language model (MLLM), forming a sparse-activated mixture-of-experts model. This design enables the model to effectively adapt to a wide array of downstream tasks. Moreover, we employ a reinforcement learning-based training objective to train our model as a Q-function after deeply exploring the structural properties of our tasks. Effective learning from automatically collected mixed-quality data enhances data efficiency and model performance. Extensive experiments demonstrate that MoRE outperforms all baselines across six different skills and exhibits superior generalization capabilities in out-of-distribution scenarios. We further validate our method in real-world scenarios, confirming the practicality of our approach and laying a solid foundation for future research on multi-task learning in quadruped robots.

MoRE: Unlocking Scalability in Reinforcement Learning for Quadruped Vision-Language-Action Models

TL;DR

MoRE introduces a scalable approach to learning quadruped vision-language-action controllers by embedding a sparse mixture of LoRA experts within a dense multimodal transformer and optimizing with an offline RL objective as a Q-function. By leveraging mixed-quality data (expert and sub-optimal trajectories), it achieves data-efficient, multi-task policy learning. In simulation and real-world experiments, MoRE outperforms baselines across six skills and demonstrates robust generalization to unseen scenarios. This work advances multi-task learning in quadruped robotics by fusing MoE-based adaptation with RL fine-tuning of VLA models using mixed data.

Abstract

Developing versatile quadruped robots that can smoothly perform various actions and tasks in real-world environments remains a significant challenge. This paper introduces a novel vision-language-action (VLA) model, mixture of robotic experts (MoRE), for quadruped robots that aim to introduce reinforcement learning (RL) for fine-tuning large-scale VLA models with a large amount of mixed-quality data. MoRE integrates multiple low-rank adaptation modules as distinct experts within a dense multi-modal large language model (MLLM), forming a sparse-activated mixture-of-experts model. This design enables the model to effectively adapt to a wide array of downstream tasks. Moreover, we employ a reinforcement learning-based training objective to train our model as a Q-function after deeply exploring the structural properties of our tasks. Effective learning from automatically collected mixed-quality data enhances data efficiency and model performance. Extensive experiments demonstrate that MoRE outperforms all baselines across six different skills and exhibits superior generalization capabilities in out-of-distribution scenarios. We further validate our method in real-world scenarios, confirming the practicality of our approach and laying a solid foundation for future research on multi-task learning in quadruped robots.

Paper Structure

This paper contains 14 sections, 10 equations, 5 figures, 2 tables.

Figures (5)

  • Figure 1: Visualization of MoRE on multi tasks.MoRE has been verified to be robust across various tasks, commands, and scenarios, in both simulation environments and real-world deployments.
  • Figure 2: Overview of MoRE as applied to our multi-task quadruped vision-language-action task. The overview consists of four key components: (1) broad sub-optimal data combined with narrow expert data, (2) the MLLM backbone to generate action tokens from image and text embedding, (3) the Mixture of LoRA Experts finetuned to adapt to different tasks, and (4) the RL objectives used for training.
  • Figure 3: The network architecture of MoRE. This figure illustrates the architecture of MoRE, which uses a decoder-only transformer (Fuyu 8B fuyu-8b) integrated with a Mixture of LoRA Experts. Tokens from different tasks such as locomotion, navigation, and manipulation are routed through a shared feed-forward network (FFN) with each expert dynamically selected by the router to provide the most relevant token-specific adaptation. The mixture-of-experts approaches allows for flexible token adaption within a single model.
  • Figure 4: The analysis of the structure of our task. This figure illustrates the whole structure and intuition behind the critical points.
  • Figure 5: Real-world experiments. These images showcase the robot successfully performing various tasks, involving navigation ("Go to"), adjusting body posture during locomotion ("Crawl"), and whole body manipulation ("Unload").