Table of Contents
Fetching ...

NavAgent: Multi-scale Urban Street View Fusion For UAV Embodied Vision-and-Language Navigation

Youzhi Liu, Fanglong Yao, Yuanchang Yue, Guangluan Xu, Xian Sun, Kun Fu

TL;DR

NavAgent is proposed, the first urban UAV embodied navigation model driven by a large Vision-Language Model that outperforms strong baseline models and develops dynamically growing scene topology map that integrate environmental information and employ Graph Convolutional Networks to encode global environmental data.

Abstract

Vision-and-Language Navigation (VLN), as a widely discussed research direction in embodied intelligence, aims to enable embodied agents to navigate in complicated visual environments through natural language commands. Most existing VLN methods focus on indoor ground robot scenarios. However, when applied to UAV VLN in outdoor urban scenes, it faces two significant challenges. First, urban scenes contain numerous objects, which makes it challenging to match fine-grained landmarks in images with complex textual descriptions of these landmarks. Second, overall environmental information encompasses multiple modal dimensions, and the diversity of representations significantly increases the complexity of the encoding process. To address these challenges, we propose NavAgent, the first urban UAV embodied navigation model driven by a large Vision-Language Model. NavAgent undertakes navigation tasks by synthesizing multi-scale environmental information, including topological maps (global), panoramas (medium), and fine-grained landmarks (local). Specifically, we utilize GLIP to build a visual recognizer for landmark capable of identifying and linguisticizing fine-grained landmarks. Subsequently, we develop dynamically growing scene topology map that integrate environmental information and employ Graph Convolutional Networks to encode global environmental data. In addition, to train the visual recognizer for landmark, we develop NavAgent-Landmark2K, the first fine-grained landmark dataset for real urban street scenes. In experiments conducted on the Touchdown and Map2seq datasets, NavAgent outperforms strong baseline models. The code and dataset will be released to the community to facilitate the exploration and development of outdoor VLN.

NavAgent: Multi-scale Urban Street View Fusion For UAV Embodied Vision-and-Language Navigation

TL;DR

NavAgent is proposed, the first urban UAV embodied navigation model driven by a large Vision-Language Model that outperforms strong baseline models and develops dynamically growing scene topology map that integrate environmental information and employ Graph Convolutional Networks to encode global environmental data.

Abstract

Vision-and-Language Navigation (VLN), as a widely discussed research direction in embodied intelligence, aims to enable embodied agents to navigate in complicated visual environments through natural language commands. Most existing VLN methods focus on indoor ground robot scenarios. However, when applied to UAV VLN in outdoor urban scenes, it faces two significant challenges. First, urban scenes contain numerous objects, which makes it challenging to match fine-grained landmarks in images with complex textual descriptions of these landmarks. Second, overall environmental information encompasses multiple modal dimensions, and the diversity of representations significantly increases the complexity of the encoding process. To address these challenges, we propose NavAgent, the first urban UAV embodied navigation model driven by a large Vision-Language Model. NavAgent undertakes navigation tasks by synthesizing multi-scale environmental information, including topological maps (global), panoramas (medium), and fine-grained landmarks (local). Specifically, we utilize GLIP to build a visual recognizer for landmark capable of identifying and linguisticizing fine-grained landmarks. Subsequently, we develop dynamically growing scene topology map that integrate environmental information and employ Graph Convolutional Networks to encode global environmental data. In addition, to train the visual recognizer for landmark, we develop NavAgent-Landmark2K, the first fine-grained landmark dataset for real urban street scenes. In experiments conducted on the Touchdown and Map2seq datasets, NavAgent outperforms strong baseline models. The code and dataset will be released to the community to facilitate the exploration and development of outdoor VLN.

Paper Structure

This paper contains 24 sections, 19 equations, 11 figures, 4 tables, 1 algorithm.

Figures (11)

  • Figure 1: Schematic diagram of the VLN model augmented by multi-scale environment fusion, with the environment topology map containing the overall information of the environment in the yellow box, the observation image of the agent at this point in the red box, the fine-grained landmarks extracted from the observation image in the green box, and the navigation text in the black box .
  • Figure 2: Examples of the Touchdown and Map2seq datasets. In Fig. (a), an example of the Touchdown dataset is presented, featuring the navigated gold route on the left, several nodes along the route with their corresponding observation images displayed above, and the navigation text at the bottom. An example of the Map2seq dataset is shown in Fig. (b), maintaining the same layout as in Fig. (a).
  • Figure 3: Figure shows the construction process and a specific example of the NavAgent-Landmark2K dataset.
  • Figure 4: The overall pipeline. At step $t$, the region features $O$ extracted from the observation image $I_t$ and the text features $B$ of the landmark text extracted in the text extractor for landmark are computed to obtain the matching score, and then linguistically verbalized in the Verbalizer to obtain the landmark information $X$. The environmental topology map $S$ is encoded by the topology map encoder to extract node features $P$. The node features $P$ and the current observation image features $I$ are utilized to compute the global feature $M_t$ through a cross-attention mechanism. Finally, the global feature $M_t$ and the landmark information $X$ are input into the LLM. After processing, the LLM generates action instructions.
  • Figure 5: Figure (a) shows the distribution of landmark text lengths in the dataset, and Figure (b) shows the distribution of landmark types.
  • ...and 6 more figures