Table of Contents
Fetching ...

MoIRA: Modular Instruction Routing Architecture for Multi-Task Robotics

Dmytro Kuzmenko, Nadiya Shvai

TL;DR

MoIRA is proposed, an architecture-agnostic modular MoE framework designed to coordinate existing experts with an external text-based router and demonstrates the practical viability of modular deployment with precise, low-effort routing and provides an alternative, scalable foundation for future multi-expert robotic systems.

Abstract

Mixture-of-Experts (MoE) approaches have recently gained traction in robotics applications due to their ability to dynamically allocate computational resources and specialize sub-networks for distinct tasks or environmental contexts, enabling more efficient decision-making. Such systems often comprise sparsely activated experts combined under a single monolithic architecture and require a well-configured internal routing mechanism, which does not allow for selective low-level expert and router customization and requires additional training. We propose MoIRA, an architecture-agnostic modular MoE framework designed to coordinate existing experts with an external text-based router. MoIRA incorporates two zero-shot routing options: embedding-based similarity and prompt-driven language model inference. In our experiments, we choose large Vision-Language-Action models, gr00t-N1 and $π_0$, as the underlying experts, and train low-rank adapters for low-overhead inference. We evaluate MoIRA on various GR1 Humanoid tasks and LIBERO Spatial and Goal benchmarks, where it consistently outperforms generalist models and competes with other MoE pipelines. Additionally, we analyse the robustness of the proposed approach to the variations of the instructions. While relying solely on textual descriptions of tasks and experts, MoIRA demonstrates the practical viability of modular deployment with precise, low-effort routing and provides an alternative, scalable foundation for future multi-expert robotic systems.

MoIRA: Modular Instruction Routing Architecture for Multi-Task Robotics

TL;DR

MoIRA is proposed, an architecture-agnostic modular MoE framework designed to coordinate existing experts with an external text-based router and demonstrates the practical viability of modular deployment with precise, low-effort routing and provides an alternative, scalable foundation for future multi-expert robotic systems.

Abstract

Mixture-of-Experts (MoE) approaches have recently gained traction in robotics applications due to their ability to dynamically allocate computational resources and specialize sub-networks for distinct tasks or environmental contexts, enabling more efficient decision-making. Such systems often comprise sparsely activated experts combined under a single monolithic architecture and require a well-configured internal routing mechanism, which does not allow for selective low-level expert and router customization and requires additional training. We propose MoIRA, an architecture-agnostic modular MoE framework designed to coordinate existing experts with an external text-based router. MoIRA incorporates two zero-shot routing options: embedding-based similarity and prompt-driven language model inference. In our experiments, we choose large Vision-Language-Action models, gr00t-N1 and , as the underlying experts, and train low-rank adapters for low-overhead inference. We evaluate MoIRA on various GR1 Humanoid tasks and LIBERO Spatial and Goal benchmarks, where it consistently outperforms generalist models and competes with other MoE pipelines. Additionally, we analyse the robustness of the proposed approach to the variations of the instructions. While relying solely on textual descriptions of tasks and experts, MoIRA demonstrates the practical viability of modular deployment with precise, low-effort routing and provides an alternative, scalable foundation for future multi-expert robotic systems.

Paper Structure

This paper contains 20 sections, 2 equations, 8 figures, 7 tables, 1 algorithm.

Figures (8)

  • Figure 1: Overview of the MoIRA framework. The architecture decouples policy learning from task assignment by coordinating (1) a pool of independently fine-tuned specialist agents via (2) textual meta-descriptions. The (3) Router core performs zero-shot assignment of the unseen task $t$ to the optimal specialist index $r$ using either embedding similarity or prompt-driven reasoning. This design allows modular expert addition without retraining the routing mechanism.
  • Figure 2: Visualization of policy rollouts. GR1 Humanoid embodiment (arms-only, full upper body) completing pouring and pick-and-place tasks and a Panda Franka arm successfully executing manipulation primitives on the LIBERO Spatial benchmark.
  • Figure 3: Comparison of expert serving strategies in MoIRA. We illustrate three deployment methods: (I) Fully Instantiated maintains active copies of all experts in memory; while fast, this incurs prohibitive linear VRAM scaling. (II) Disk-Based Swapping reduces memory footprint by loading adapters from storage on-demand, but introduces a high I/O latency bottleneck ($\sim$10s). (III) Multi-LoRA Serving keeps the backbone resident and maintains multiple lightweight adapters concurrently, enabling hot-switching via pointer changes ($\sim$20ms) while avoiding full model replication and large I/O swap costs.
  • Figure 4: The initial stage of MoIRA's agentic adapter fine-tuning. We finetune VLA backbones (Gr00t-N1, $\pi_0$-base) and derive modular specialists via LoRA: embodiment adapters for GR1 manipulation tasks and semantic adapters for LIBERO Goal and Spatial tasks. Although demonstrated on VLAs, MoIRA’s unified interface can accommodate any agentic backbone, e.g. transformer-based or model-based approaches.
  • Figure 5: MoIRA MoE routing module. Given a textual task description and specialist meta descriptions, the router assigns a specialist ID via one of two strategies. A prompt-driven language model (SmolLM2) formats all inputs into a single inference prompt with few-shot examples to infer the matching expert (orange). A lightweight embedding-based method (MiniLM) computes cosine similarity between the task and cached meta descriptions to return the closest expert (purple). Both variants support modular, language-based task routing without requiring direct observation input.
  • ...and 3 more figures