Table of Contents
Fetching ...

Multimodal Large Language Model for Visual Navigation

Yao-Hung Hubert Tsai, Vansh Dhar, Jialu Li, Bowen Zhang, Jian Zhang

TL;DR

This work reframes visual navigation as a fine-tuning problem for multimodal LLMs, replacing prompt-engineering with a structured architecture that combines a history collector, a visual observation encoder, and a pre-trained LLM to output action probability distributions. By training on HM3D human demonstrations and incorporating collision signals, the method outperforms state-of-the-art behavior cloning and reduces collision rates. Key contributions include a five-module architecture, a fixed-history encoder during fine-tuning, and a learning objective that leverages a blended distribution from BC and ground-truth actions. The approach demonstrates the viability and practical impact of fine-tuning LLMs for long-horizon, partial-observation visual navigation tasks.

Abstract

Recent efforts to enable visual navigation using large language models have mainly focused on developing complex prompt systems. These systems incorporate instructions, observations, and history into massive text prompts, which are then combined with pre-trained large language models to facilitate visual navigation. In contrast, our approach aims to fine-tune large language models for visual navigation without extensive prompt engineering. Our design involves a simple text prompt, current observations, and a history collector model that gathers information from previous observations as input. For output, our design provides a probability distribution of possible actions that the agent can take during navigation. We train our model using human demonstrations and collision signals from the Habitat-Matterport 3D Dataset (HM3D). Experimental results demonstrate that our method outperforms state-of-the-art behavior cloning methods and effectively reduces collision rates.

Multimodal Large Language Model for Visual Navigation

TL;DR

This work reframes visual navigation as a fine-tuning problem for multimodal LLMs, replacing prompt-engineering with a structured architecture that combines a history collector, a visual observation encoder, and a pre-trained LLM to output action probability distributions. By training on HM3D human demonstrations and incorporating collision signals, the method outperforms state-of-the-art behavior cloning and reduces collision rates. Key contributions include a five-module architecture, a fixed-history encoder during fine-tuning, and a learning objective that leverages a blended distribution from BC and ground-truth actions. The approach demonstrates the viability and practical impact of fine-tuning LLMs for long-horizon, partial-observation visual navigation tasks.

Abstract

Recent efforts to enable visual navigation using large language models have mainly focused on developing complex prompt systems. These systems incorporate instructions, observations, and history into massive text prompts, which are then combined with pre-trained large language models to facilitate visual navigation. In contrast, our approach aims to fine-tune large language models for visual navigation without extensive prompt engineering. Our design involves a simple text prompt, current observations, and a history collector model that gathers information from previous observations as input. For output, our design provides a probability distribution of possible actions that the agent can take during navigation. We train our model using human demonstrations and collision signals from the Habitat-Matterport 3D Dataset (HM3D). Experimental results demonstrate that our method outperforms state-of-the-art behavior cloning methods and effectively reduces collision rates.
Paper Structure (21 sections, 1 equation, 4 figures, 3 tables)

This paper contains 21 sections, 1 equation, 4 figures, 3 tables.

Figures (4)

  • Figure 1: Our approach leverages a finetune multimodal large language model to solve object goal navigation.
  • Figure 2: Architecture for fine-tuning large language models for visual navigation. The history collector model is responsible for encoding history features from the current observation and past history. The observation encoding model encodes observation features. The projection layer transforms history tokens and observation tokens from history and observation features, respectively. The text prompt is used to provide hints to the large language models (LLMs) for visual navigation. The pre-trained large language model takes text tokens, history tokens, and observation tokens as input, and generates a probability distribution over a set of actions as text output.
  • Figure 3: Qualitative results for comparing LLMs fine-tuned with visual navigation between direct action output and probability output. We show the results on the same scene, same initial location, and the same target object goal.
  • Figure 4: Qualitative results for comparing LLMs fine-tuned with visual navigation between with and without collision check. We show the results on the same scene, same initial location, and the same target object goal.