Table of Contents
Fetching ...

Image-Goal Navigation Using Refined Feature Guidance and Scene Graph Enhancement

Zhicheng Feng, Xieyuanli Chen, Chenghao Shi, Lun Luo, Zhichao Chen, Yun-Hui Liu, Huimin Lu

TL;DR

The paper tackles image-goal navigation under limited visual input by introducing RFSG, a lightweight framework that fuses goal and observation features with refined guidance and scene context. It combines a dual-branch encoder, Spatial-Channel Attention with a Weight Decoupling Module, and a parameter-free self-distillation mechanism, plus an Image Scene Graph that integrates image and instance features via a GCN to produce informative environmental features $F_{ ext{env}}$. The navigation policy is learned end-to-end with PPO in an Actor-Critic setup, leveraging $F_{ ext{backbone}}$ and $F_{ ext{env}}$ to decide actions $a_t$ while conditioning on prior actions and hidden state $h_{t-1}$. Empirically, RFSG achieves state-of-the-art results on Gibson and HM3D with cross-scene generalization and real-time inference (up to $53.5$ FPS on a RTX $3080$), and ablation studies confirm the additive benefits of feature fusion, SCA, WDM, self-distillation, and the scene graph.

Abstract

In this paper, we introduce a novel image-goal navigation approach, named RFSG. Our focus lies in leveraging the fine-grained connections between goals, observations, and the environment within limited image data, all the while keeping the navigation architecture simple and lightweight. To this end, we propose the spatial-channel attention mechanism, enabling the network to learn the importance of multi-dimensional features to fuse the goal and observation features. In addition, a selfdistillation mechanism is incorporated to further enhance the feature representation capabilities. Given that the navigation task needs surrounding environmental information for more efficient navigation, we propose an image scene graph to establish feature associations at both the image and object levels, effectively encoding the surrounding scene information. Crossscene performance validation was conducted on the Gibson and HM3D datasets, and the proposed method achieved stateof-the-art results among mainstream methods, with a speed of up to 53.5 frames per second on an RTX3080. This contributes to the realization of end-to-end image-goal navigation in realworld scenarios. The implementation and model of our method have been released at: https://github.com/nubot-nudt/RFSG.

Image-Goal Navigation Using Refined Feature Guidance and Scene Graph Enhancement

TL;DR

The paper tackles image-goal navigation under limited visual input by introducing RFSG, a lightweight framework that fuses goal and observation features with refined guidance and scene context. It combines a dual-branch encoder, Spatial-Channel Attention with a Weight Decoupling Module, and a parameter-free self-distillation mechanism, plus an Image Scene Graph that integrates image and instance features via a GCN to produce informative environmental features . The navigation policy is learned end-to-end with PPO in an Actor-Critic setup, leveraging and to decide actions while conditioning on prior actions and hidden state . Empirically, RFSG achieves state-of-the-art results on Gibson and HM3D with cross-scene generalization and real-time inference (up to FPS on a RTX ), and ablation studies confirm the additive benefits of feature fusion, SCA, WDM, self-distillation, and the scene graph.

Abstract

In this paper, we introduce a novel image-goal navigation approach, named RFSG. Our focus lies in leveraging the fine-grained connections between goals, observations, and the environment within limited image data, all the while keeping the navigation architecture simple and lightweight. To this end, we propose the spatial-channel attention mechanism, enabling the network to learn the importance of multi-dimensional features to fuse the goal and observation features. In addition, a selfdistillation mechanism is incorporated to further enhance the feature representation capabilities. Given that the navigation task needs surrounding environmental information for more efficient navigation, we propose an image scene graph to establish feature associations at both the image and object levels, effectively encoding the surrounding scene information. Crossscene performance validation was conducted on the Gibson and HM3D datasets, and the proposed method achieved stateof-the-art results among mainstream methods, with a speed of up to 53.5 frames per second on an RTX3080. This contributes to the realization of end-to-end image-goal navigation in realworld scenarios. The implementation and model of our method have been released at: https://github.com/nubot-nudt/RFSG.

Paper Structure

This paper contains 15 sections, 16 equations, 7 figures, 7 tables.

Figures (7)

  • Figure 1: A robot relies only on the goal and observed images to accomplish navigation in the environment. We focus on building a fine-grained feature guidance from the goal to the observed images, and capturing important environmental information from multiple scene images through a scene graph.
  • Figure 2: Overall structure of the proposed RFSG. Initially, feature extraction is achieved by the backbone ResNet9 with self-distillation, and a feature fusion strategy Ⓕ is introduced to fine-grained capture important goal and observation features. Subsequently, it combines the instance targets in the image to construct a scene graph to capture the environment information. Finally, the scene graph feature $F_\text{env}$ is embedded into the backbone feature $F_\text{backbone}$, and the next action of the robot is generated based on the Actor-Critic architecture. The actions of the robot contain FORWARD, BACKWARD, LEFT TURN, RIGHT TURN, and STOP.
  • Figure 3: The proposed spatial-channel attention. Important features are focused on the spatial and channel dimensions respectively, and then cross-multiplication $\otimes$ is utilized to obtain the output feature. Here, $\odot$ is the matrix multiplication.
  • Figure 4: The proposed weight decoupling module. Generating affine transform factors $A$, $x$, and $b$ by three branches and generating fusion weight $W$ using multiplication operation $\otimes$ and summation operation $\oplus$.
  • Figure 5: The self-distillation mechanism. The spatial features of the shallow feature are first transformed to the channel by space-to-depth operation, then the parameter-free attention mechanism SimAM yang2021simam is introduced to generate the 3D weights, and finally the model is supervised by the similarity between the 3D weights.
  • ...and 2 more figures