GeRM: A Generalist Robotic Model with Mixture-of-experts for Quadruped Robot

Wenxuan Song; Han Zhao; Pengxiang Ding; Can Cui; Shangke Lyu; Yaning Fan; Donglin Wang

GeRM: A Generalist Robotic Model with Mixture-of-experts for Quadruped Robot

Wenxuan Song, Han Zhao, Pengxiang Ding, Can Cui, Shangke Lyu, Yaning Fan, Donglin Wang

TL;DR

GeRM addresses data-efficiency and multi-task learning for quadruped robotics by applying offline reinforcement learning to demonstrations and sub-optimal data, processed through a Vision-Language-Action model built on a sparse Mixture-of-Experts Transformer with top-$2$ expert routing and $d_{\mathcal{A}}=12$ action dimensions. The approach integrates a conservative Q-learning objective to curb out-of-distribution actions and demonstrates that a high-capacity MoE model can yield generalist capabilities with modest active compute. A novel auto-collected QUARD-Auto dataset (257k–258k trajectories) complements human demonstrations, enabling efficient training and emergent skill development in 99 tasks. Empirical results show GeRM outperforms imitation-learning baselines and prior offline RL methods, highlights data-utilization advantages, and reveals emergent planning behaviors, signaling a scalable direction for real-world, multi-task quadruped learning.

Abstract

Multi-task robot learning holds significant importance in tackling diverse and complex scenarios. However, current approaches are hindered by performance issues and difficulties in collecting training datasets. In this paper, we propose GeRM (Generalist Robotic Model). We utilize offline reinforcement learning to optimize data utilization strategies to learn from both demonstrations and sub-optimal data, thus surpassing the limitations of human demonstrations. Thereafter, we employ a transformer-based VLA network to process multi-modal inputs and output actions. By introducing the Mixture-of-Experts structure, GeRM allows faster inference speed with higher whole model capacity, and thus resolves the issue of limited RL parameters, enhancing model performance in multi-task learning while controlling computational costs. Through a series of experiments, we demonstrate that GeRM outperforms other methods across all tasks, while also validating its efficiency in both training and inference processes. Additionally, we uncover its potential to acquire emergent skills. Additionally, we contribute the QUARD-Auto dataset, collected automatically to support our training approach and foster advancements in multi-task quadruped robot learning. This work presents a new paradigm for reducing the cost of collecting robot data and driving progress in the multi-task learning community. You can reach our project and video through the link: https://songwxuan.github.io/GeRM/ .

GeRM: A Generalist Robotic Model with Mixture-of-experts for Quadruped Robot

TL;DR

expert routing and

action dimensions. The approach integrates a conservative Q-learning objective to curb out-of-distribution actions and demonstrates that a high-capacity MoE model can yield generalist capabilities with modest active compute. A novel auto-collected QUARD-Auto dataset (257k–258k trajectories) complements human demonstrations, enabling efficient training and emergent skill development in 99 tasks. Empirical results show GeRM outperforms imitation-learning baselines and prior offline RL methods, highlights data-utilization advantages, and reveals emergent planning behaviors, signaling a scalable direction for real-world, multi-task quadruped learning.

Abstract

Paper Structure (11 sections, 9 equations, 6 figures, 3 tables)

This paper contains 11 sections, 9 equations, 6 figures, 3 tables.

Introduction
Related Work
Preliminaries
Methods
Auto-collected Quadruped Robot Datasets
Mixture-of-Experts Network
Vision-Language-Action Model in Reinforcement Learning
Experiments
Experiments Setup
Experimental Results
Conclusion, Limitations and Future Work

Figures (6)

Figure 1: Overview of GeRM. We take both demonstration and sub-optimal data as input. Then the images and instructions are tokenized and sent into the mixture-of-experts Transformer Decoder to generate action tokens. They are finally de-tokenized into discretized robot commands. The actions are used for RL objectives when training.
Figure 2: Emergent Skills. The example of the emergent skill of dynamic adaptive path planning. We study these challenging scenarios in detail in Section \ref{['Q4']}.
Figure 3: Statistic of QUARD-Auto. The Bottom parts denote the successful tasks; the Top parts denote the failed tasks.
Figure 4: Decoder Structure.Left: Conventional Transformer Decoder; Right: GeRM Transformer Decoder with MoE Module.
Figure 5: Training dataset. The ratio of the optimal trajectories and sub-optimal trajectories used in training.The unit of trajectory number in the graph is K=10$^3$.
...and 1 more figures

GeRM: A Generalist Robotic Model with Mixture-of-experts for Quadruped Robot

TL;DR

Abstract

GeRM: A Generalist Robotic Model with Mixture-of-experts for Quadruped Robot

Authors

TL;DR

Abstract

Table of Contents

Figures (6)