Table of Contents
Fetching ...

TagaVLM: Topology-Aware Global Action Reasoning for Vision-Language Navigation

Jiaxing Liu, Zexi Zhang, Xiaoyan Li, Boyue Wang, Yongli Hu, Baocai Yin

TL;DR

On the R2R benchmark, TagaVLM achieves state-of-the-art performance among large-model-based methods, with a Success Rate of 51.09% and SPL of 47.18 in unseen environments, demonstrating that, for embodied spatial reasoning, targeted enhancements on smaller open-source VLMs can be more effective than brute-force model scaling.

Abstract

Vision-Language Navigation (VLN) presents a unique challenge for Large Vision-Language Models (VLMs) due to their inherent architectural mismatch: VLMs are primarily pretrained on static, disembodied vision-language tasks, which fundamentally clash with the dynamic, embodied, and spatially-structured nature of navigation. Existing large-model-based methods often resort to converting rich visual and spatial information into text, forcing models to implicitly infer complex visual-topological relationships or limiting their global action capabilities. To bridge this gap, we propose TagaVLM (Topology-Aware Global Action reasoning), an end-to-end framework that explicitly injects topological structures into the VLM backbone. To introduce topological edge information, Spatial Topology Aware Residual Attention (STAR-Att) directly integrates it into the VLM's self-attention mechanism, enabling intrinsic spatial reasoning while preserving pretrained knowledge. To enhance topological node information, an Interleaved Navigation Prompt strengthens node-level visual-text alignment. Finally, with the embedded topological graph, the model is capable of global action reasoning, allowing for robust path correction. On the R2R benchmark, TagaVLM achieves state-of-the-art performance among large-model-based methods, with a Success Rate (SR) of 51.09% and SPL of 47.18 in unseen environments, outperforming prior work by 3.39% in SR and 9.08 in SPL. This demonstrates that, for embodied spatial reasoning, targeted enhancements on smaller open-source VLMs can be more effective than brute-force model scaling. The code will be released upon publication.Project page: https://apex-bjut.github.io/Taga-VLM

TagaVLM: Topology-Aware Global Action Reasoning for Vision-Language Navigation

TL;DR

On the R2R benchmark, TagaVLM achieves state-of-the-art performance among large-model-based methods, with a Success Rate of 51.09% and SPL of 47.18 in unseen environments, demonstrating that, for embodied spatial reasoning, targeted enhancements on smaller open-source VLMs can be more effective than brute-force model scaling.

Abstract

Vision-Language Navigation (VLN) presents a unique challenge for Large Vision-Language Models (VLMs) due to their inherent architectural mismatch: VLMs are primarily pretrained on static, disembodied vision-language tasks, which fundamentally clash with the dynamic, embodied, and spatially-structured nature of navigation. Existing large-model-based methods often resort to converting rich visual and spatial information into text, forcing models to implicitly infer complex visual-topological relationships or limiting their global action capabilities. To bridge this gap, we propose TagaVLM (Topology-Aware Global Action reasoning), an end-to-end framework that explicitly injects topological structures into the VLM backbone. To introduce topological edge information, Spatial Topology Aware Residual Attention (STAR-Att) directly integrates it into the VLM's self-attention mechanism, enabling intrinsic spatial reasoning while preserving pretrained knowledge. To enhance topological node information, an Interleaved Navigation Prompt strengthens node-level visual-text alignment. Finally, with the embedded topological graph, the model is capable of global action reasoning, allowing for robust path correction. On the R2R benchmark, TagaVLM achieves state-of-the-art performance among large-model-based methods, with a Success Rate (SR) of 51.09% and SPL of 47.18 in unseen environments, outperforming prior work by 3.39% in SR and 9.08 in SPL. This demonstrates that, for embodied spatial reasoning, targeted enhancements on smaller open-source VLMs can be more effective than brute-force model scaling. The code will be released upon publication.Project page: https://apex-bjut.github.io/Taga-VLM
Paper Structure (12 sections, 3 equations, 4 figures, 3 tables)

This paper contains 12 sections, 3 equations, 4 figures, 3 tables.

Figures (4)

  • Figure 1: Motivation of the proposed method. Previous methods (c) usually employ a two-stage pipeline that uses VLMs to convert visual observations to text for decision-making in LLM, losing crucial visual information. In contrast, our TagaVLM (b) is an end-to-end paradigm that preserves VLM pretraining knowledge while incorporating online topological map information, enabling global action decisions (a) with backtracking ability.
  • Figure 2: Overview of the TagaVLM. The pretrained observation encoder and projector encode RGB observations from each node to the semantic space. Textual information containing navigation system prompts and navigation observation descriptions passes through embedding layers to obtain text embedding sequences. The observation feature sequences from each node are inserted into the text embedding sequences according to the corresponding <image> placeholder positions in the navigation observation descriptions, resulting in Interleaved Navigation Prompt Input. The LLM backbone is optimized with STAR-Att, enforcing the awareness of edge-level relationships by node pairwise distance matrices. Finally, the decisions are made in a global action space to guide the agent to move to the target node.
  • Figure 3: In the navigation process, if an unvisited candidate node is observed multiple times at different positions, it is represented by stitching the images captured during all these observations. E.g., if $Node_3$ is observed both at position $Node_1$ and $Node_2$, the representation of $Node_3$ will be formed by concatenating the image of $Node_3$ observed at both $Node_1$ and $Node_2$.
  • Figure 4: A successful case demonstrates TagaVLM's spatial topological awareness and path correction ability. (a) shows the navigation instruction containing two key landmarks: black chairs and refrigerator. (b) Shows TagaVLM's 6 navigation steps from the starting node $Node_1$ to the target destination. In $Step_1$, due to the absence of landmark references, TagaVLM selected an incorrect direction. In $Step_2$, TagaVLM leveraged its spatial topological awareness capability and performed global action reasoning, selecting candidate node $Node_5$ of $Node_1$, backtracking from $Node_2$ to $Node_1$, and then moving to $Node_5$, successfully correcting the path. $Steps_{3-5}$ successfully followed the instruction by moving to the front of the black chairs, turning right, and moving to the front of the refrigerator. Finally, in $Step_6$, TagaVLM made a stop decision and successfully reached the target destination.