Table of Contents
Fetching ...

FLAME: Learning to Navigate with Multimodal LLM in Urban Environments

Yunzhe Xu, Yiyuan Pan, Zhe Liu, Hesheng Wang

TL;DR

This work tackles outdoor urban Vision-and-Language Navigation by introducing FLAME, a Flamingo-based Multimodal LLM agent that navigates using interleaved text and vision with no context-length expansion. It adopts a three-phase tuning pipeline (single-perception, multiple-perception, end-to-end) aided by synthetic data (street-view captions, route summaries, and rationales) generated with GPT-4, enabling efficient end-to-end training on urban VLN tasks. On Touchdown and Map2seq, FLAME achieves state-of-the-art task completion (TC) improvements of 7.3% and 3.74%, respectively, and demonstrates robust reasoning via high RC and RA scores under self-consistency. The results indicate that carefully tuned MLLMs can surpass specialized VLN models in complex, real-world environments, highlighting the potential of multimodal reasoning for embodied navigation.

Abstract

Large Language Models (LLMs) have demonstrated potential in Vision-and-Language Navigation (VLN) tasks, yet current applications face challenges. While LLMs excel in general conversation scenarios, they struggle with specialized navigation tasks, yielding suboptimal performance compared to specialized VLN models. We introduce FLAME (FLAMingo-Architected Embodied Agent), a novel Multimodal LLM-based agent and architecture designed for urban VLN tasks that efficiently handles multiple observations. Our approach implements a three-phase tuning technique for effective adaptation to navigation tasks, including single perception tuning for street view description, multiple perception tuning for route summarization, and end-to-end training on VLN datasets. The augmented datasets are synthesized automatically. Experimental results demonstrate FLAME's superiority over existing methods, surpassing state-of-the-art methods by a 7.3% increase in task completion on Touchdown dataset. This work showcases the potential of Multimodal LLMs (MLLMs) in complex navigation tasks, representing an advancement towards applications of MLLMs in the field of embodied intelligence.

FLAME: Learning to Navigate with Multimodal LLM in Urban Environments

TL;DR

This work tackles outdoor urban Vision-and-Language Navigation by introducing FLAME, a Flamingo-based Multimodal LLM agent that navigates using interleaved text and vision with no context-length expansion. It adopts a three-phase tuning pipeline (single-perception, multiple-perception, end-to-end) aided by synthetic data (street-view captions, route summaries, and rationales) generated with GPT-4, enabling efficient end-to-end training on urban VLN tasks. On Touchdown and Map2seq, FLAME achieves state-of-the-art task completion (TC) improvements of 7.3% and 3.74%, respectively, and demonstrates robust reasoning via high RC and RA scores under self-consistency. The results indicate that carefully tuned MLLMs can surpass specialized VLN models in complex, real-world environments, highlighting the potential of multimodal reasoning for embodied navigation.

Abstract

Large Language Models (LLMs) have demonstrated potential in Vision-and-Language Navigation (VLN) tasks, yet current applications face challenges. While LLMs excel in general conversation scenarios, they struggle with specialized navigation tasks, yielding suboptimal performance compared to specialized VLN models. We introduce FLAME (FLAMingo-Architected Embodied Agent), a novel Multimodal LLM-based agent and architecture designed for urban VLN tasks that efficiently handles multiple observations. Our approach implements a three-phase tuning technique for effective adaptation to navigation tasks, including single perception tuning for street view description, multiple perception tuning for route summarization, and end-to-end training on VLN datasets. The augmented datasets are synthesized automatically. Experimental results demonstrate FLAME's superiority over existing methods, surpassing state-of-the-art methods by a 7.3% increase in task completion on Touchdown dataset. This work showcases the potential of Multimodal LLMs (MLLMs) in complex navigation tasks, representing an advancement towards applications of MLLMs in the field of embodied intelligence.
Paper Structure (31 sections, 8 equations, 5 figures, 4 tables)

This paper contains 31 sections, 8 equations, 5 figures, 4 tables.

Figures (5)

  • Figure 1: LLM-based agents excel in conversation but often falter in specialized navigation tasks. Our agent, powered solely by a Multimodal LLM, demonstrates proficiency in navigation skills, efficiently adapting to navigation-specific scenarios through targeted finetuning phases.
  • Figure 2: Overview of FLAME's navigation process at time step $t_n$. The architecture, based on Flamingo, integrates vision modules for observation processing and decoder blocks for instruction and history handling. The finetuned STRIDED GATED XATTN layers prioritize recent observations in cross-attention computation. At key locations (intersections), FLAME can engage in reasoning before decision-making or proceed directly to action selection. The navigation process is autoregressive.
  • Figure 3: Illustration of the three-phase tuning for navigation and synthetic data generation process. (a) The first phase trains the model on single-perception tasks. The second phase escalates to handling multi-perceptual input. Finally, the model undergoes an end-to-end finetuning. (b) We utilize LLMs to generate street view captions, route summaries and simple instructions to aid the training of the first two phase. (c) We further synthesize rationales to validate the reasoning capability of FLAME.
  • Figure 4: The effect of varying strides on TC and nDTW.
  • Figure 5: Qualitative analysis of FLAME's navigation performance. Superscripts on words and numbers in the top left corner of each image indicate the count of viewpoints encountered by the agent. The ground-truth actions are represented by colored arrows: red circular arrows for turn around, blue for turn right, and a circle for stop. Keyword and landmark alignment is highlighted by matching colors in responses and instructions. The top row shows actions taken by the baseline method schumann2022orar, while subsequent rows display FLAME's responses.