Table of Contents
Fetching ...

Aligning Knowledge Graph with Visual Perception for Object-goal Navigation

Nuo Xu, Wen Wang, Rong Yang, Mengjie Qin, Zheyuan Lin, Wei Song, Chunlong Zhang, Jason Gu, Chao Li

TL;DR

The paper tackles object-goal navigation under egocentric vision where traditional discrete KG-based navigators misalign with visual observations. It proposes AKGVP, which combines a continuous knowledge-graph representation of scenes with visual-language pre-training to align language descriptions with visual perception, enabling robust zero-shot navigation. A high-level controller based on Graph Convolutional Networks plans sub-goals on the continuous KG, while a low-level controller fuses multimodal features and learns action policies via A3C. Experiments on AI2-THOR show AKGVP outperforming state-of-the-art baselines, with strong zero-shot generalization and efficient navigation; code is released.

Abstract

Object-goal navigation is a challenging task that requires guiding an agent to specific objects based on first-person visual observations. The ability of agent to comprehend its surroundings plays a crucial role in achieving successful object finding. However, existing knowledge-graph-based navigators often rely on discrete categorical one-hot vectors and vote counting strategy to construct graph representation of the scenes, which results in misalignment with visual images. To provide more accurate and coherent scene descriptions and address this misalignment issue, we propose the Aligning Knowledge Graph with Visual Perception (AKGVP) method for object-goal navigation. Technically, our approach introduces continuous modeling of the hierarchical scene architecture and leverages visual-language pre-training to align natural language description with visual perception. The integration of a continuous knowledge graph architecture and multimodal feature alignment empowers the navigator with a remarkable zero-shot navigation capability. We extensively evaluate our method using the AI2-THOR simulator and conduct a series of experiments to demonstrate the effectiveness and efficiency of our navigator. Code available: https://github.com/nuoxu/AKGVP.

Aligning Knowledge Graph with Visual Perception for Object-goal Navigation

TL;DR

The paper tackles object-goal navigation under egocentric vision where traditional discrete KG-based navigators misalign with visual observations. It proposes AKGVP, which combines a continuous knowledge-graph representation of scenes with visual-language pre-training to align language descriptions with visual perception, enabling robust zero-shot navigation. A high-level controller based on Graph Convolutional Networks plans sub-goals on the continuous KG, while a low-level controller fuses multimodal features and learns action policies via A3C. Experiments on AI2-THOR show AKGVP outperforming state-of-the-art baselines, with strong zero-shot generalization and efficient navigation; code is released.

Abstract

Object-goal navigation is a challenging task that requires guiding an agent to specific objects based on first-person visual observations. The ability of agent to comprehend its surroundings plays a crucial role in achieving successful object finding. However, existing knowledge-graph-based navigators often rely on discrete categorical one-hot vectors and vote counting strategy to construct graph representation of the scenes, which results in misalignment with visual images. To provide more accurate and coherent scene descriptions and address this misalignment issue, we propose the Aligning Knowledge Graph with Visual Perception (AKGVP) method for object-goal navigation. Technically, our approach introduces continuous modeling of the hierarchical scene architecture and leverages visual-language pre-training to align natural language description with visual perception. The integration of a continuous knowledge graph architecture and multimodal feature alignment empowers the navigator with a remarkable zero-shot navigation capability. We extensively evaluate our method using the AI2-THOR simulator and conduct a series of experiments to demonstrate the effectiveness and efficiency of our navigator. Code available: https://github.com/nuoxu/AKGVP.
Paper Structure (11 sections, 4 equations, 3 figures, 4 tables)

This paper contains 11 sections, 4 equations, 3 figures, 4 tables.

Figures (3)

  • Figure 1: The core idea of our AKGVP method. In order to effectively describe the same environment, we can leverage two modalities of data: knowledge graphs derived from natural language descriptions and observation images characterized by visual descriptions. Our primary objective is to align these two modalities within a shared feature space, facilitated by visual-language pre-training. Ultimately, these modalities are fused for decision-making.
  • Figure 2: Pipeline of our AKGVP method. AKGVP is composed of three essential components: an aligned encoder, a high-level controller, and a low-level controller. The encoder plays a crucial role by separately encoding images and natural language, facilitating feature alignment through multimodal pre-training. The high-level controller leverages knowledge graph modeling to effectively plan sub-goals for the navigator, directing the movement of agent across different zones. On the other hand, the low-level controller utilizes the fused multimodal information to make informed action decisions, enabling the agent to interact with the environment and control its movements adeptly.
  • Figure 3: Qualitative results (zoom in for detailed viewing). The visualization of navigation results for the four navigators in three rooms is presented from left to right, along with the corresponding observations from the final frame of navigation. The red explosion icon denotes instances where the agent becomes disoriented and exhibits erratic behavior, such as spinning in circles, getting stuck by obstacles, or repetitive small-scale rotations. The blue circle represents the starting position, which is consistent across all four navigators. The histogram provides a visual representation of the action probabilities, highlighting the likelihood of the agent selecting each of the six actions. For additional instances of the comparison between these two methods, please refer to the accompanying video.