Table of Contents
Fetching ...

MM-Nav: Multi-View VLA Model for Robust Visual Navigation via Multi-Expert Learning

Tianyu Xu, Jiawei Chen, Jiazhao Zhang, Wenyao Zhang, Zekun Qi, Minghan Li, Zhizheng Zhang, He Wang

TL;DR

The paper tackles robust visual navigation with RGB data by addressing the limitations of single-view perception and sim-to-real gaps. It introduces MM-Nav, a multi-view VLA model that learns from multiple specialized RL experts (reaching, squeezing, avoiding) via a two-stage training process: offline expert-based finetuning and online capability-balanced teacher-student refinement. Using a 360-degree surround-view encoding and a large language model-guided action predictor, MM-Nav delivers continuous velocity commands at about 7 Hz and demonstrates strong generalization in both synthetic and real-world environments, often outperforming the individual RL teachers. The work shows that distilling multi-capability expertise through a capable VLA policy yields improved navigation performance and robust sim-to-real transfer, offering a scalable blueprint for general-purpose visual navigation agents.

Abstract

Visual navigation policy is widely regarded as a promising direction, as it mimics humans by using egocentric visual observations for navigation. However, optical information of visual observations is difficult to be explicitly modeled like LiDAR point clouds or depth maps, which subsequently requires intelligent models and large-scale data. To this end, we propose to leverage the intelligence of the Vision-Language-Action (VLA) model to learn diverse navigation capabilities from synthetic expert data in a teacher-student manner. Specifically, we implement the VLA model, MM-Nav, as a multi-view VLA (with 360 observations) based on pretrained large language models and visual foundation models. For large-scale navigation data, we collect expert data from three reinforcement learning (RL) experts trained with privileged depth information in three challenging tailor-made environments for different navigation capabilities: reaching, squeezing, and avoiding. We iteratively train our VLA model using data collected online from RL experts, where the training ratio is dynamically balanced based on performance on individual capabilities. Through extensive experiments in synthetic environments, we demonstrate that our model achieves strong generalization capability. Moreover, we find that our student VLA model outperforms the RL teachers, demonstrating the synergistic effect of integrating multiple capabilities. Extensive real-world experiments further confirm the effectiveness of our method.

MM-Nav: Multi-View VLA Model for Robust Visual Navigation via Multi-Expert Learning

TL;DR

The paper tackles robust visual navigation with RGB data by addressing the limitations of single-view perception and sim-to-real gaps. It introduces MM-Nav, a multi-view VLA model that learns from multiple specialized RL experts (reaching, squeezing, avoiding) via a two-stage training process: offline expert-based finetuning and online capability-balanced teacher-student refinement. Using a 360-degree surround-view encoding and a large language model-guided action predictor, MM-Nav delivers continuous velocity commands at about 7 Hz and demonstrates strong generalization in both synthetic and real-world environments, often outperforming the individual RL teachers. The work shows that distilling multi-capability expertise through a capable VLA policy yields improved navigation performance and robust sim-to-real transfer, offering a scalable blueprint for general-purpose visual navigation agents.

Abstract

Visual navigation policy is widely regarded as a promising direction, as it mimics humans by using egocentric visual observations for navigation. However, optical information of visual observations is difficult to be explicitly modeled like LiDAR point clouds or depth maps, which subsequently requires intelligent models and large-scale data. To this end, we propose to leverage the intelligence of the Vision-Language-Action (VLA) model to learn diverse navigation capabilities from synthetic expert data in a teacher-student manner. Specifically, we implement the VLA model, MM-Nav, as a multi-view VLA (with 360 observations) based on pretrained large language models and visual foundation models. For large-scale navigation data, we collect expert data from three reinforcement learning (RL) experts trained with privileged depth information in three challenging tailor-made environments for different navigation capabilities: reaching, squeezing, and avoiding. We iteratively train our VLA model using data collected online from RL experts, where the training ratio is dynamically balanced based on performance on individual capabilities. Through extensive experiments in synthetic environments, we demonstrate that our model achieves strong generalization capability. Moreover, we find that our student VLA model outperforms the RL teachers, demonstrating the synergistic effect of integrating multiple capabilities. Extensive real-world experiments further confirm the effectiveness of our method.

Paper Structure

This paper contains 14 sections, 6 equations, 6 figures, 2 tables.

Figures (6)

  • Figure 1: Real-world demonstration and hardware setup of MM-Nav. Our method demonstrates strong navigation capability within challenging environments, including avoiding thin rope, squeezing between crowded and transparent objects.
  • Figure 2: Pipeline of MM-Nav. Our proposed teachers-student training pipeline. Independent RL teachers are trained in different scenes for multi-capability and distill knowledge to the VLA student. Further, the student is deployed in the capability-specific simulation scene to be iteratively fine-tuned.
  • Figure 3: Training Strategy of MM-Nav. Stage 1: We first train RL experts in different environments and collect successful trajectories for VLA fine-tuning. Stage 2: We then collect RL expert data online based on VLA observations and dynamically balance the training data ratio.
  • Figure 4: Plot of online training iteration. (Left): the performance gaps $g_{Cap.}$ between the VLA model and the RL experts. (Right): the different proportions of our online collected data. After the fourth iteration, VLA outperforms all experts (in WTT), resulting in equal data ratio.
  • Figure 5: Real-world Experiments. (Left): As the robot advances, an operator suddenly pushes a wheelchair into its path (Right): Two human-held thin rods are crossed across the corridor.
  • ...and 1 more figures