Table of Contents
Fetching ...

TDANet: Target-Directed Attention Network For Object-Goal Visual Navigation With Zero-Shot Ability

Shiwei Lian, Feitian Zhang

TL;DR

TDANet addresses the challenge of generalizing object-goal visual navigation to unseen scenes and unseen target objects by introducing a Target Attention (TA) module and a Siamese architecture that learn a domain-independent representation. The TA module learns spatial and semantic correspondences between observed objects and the target, while the Siamese branches compare current and target states to enable zero-shot navigation within an end-to-end A3C framework. Empirical results in AI2-THOR show TDANet outperforms state-of-the-art baselines in both seen and unseen scenarios, with substantial improvements in SR and SPL, and its real-world deployment on a TurtleBot4 confirms practical generalization. The combination of adaptive object-focused attention and state-difference learning yields robust navigation policies capable of operating with unseen objects in real environments.

Abstract

The generalization of the end-to-end deep reinforcement learning (DRL) for object-goal visual navigation is a long-standing challenge since object classes and placements vary in new test environments. Learning domain-independent visual representation is critical for enabling the trained DRL agent with the ability to generalize to unseen scenes and objects. In this letter, a target-directed attention network (TDANet) is proposed to learn the end-to-end object-goal visual navigation policy with zero-shot ability. TDANet features a novel target attention (TA) module that learns both the spatial and semantic relationships among objects to help TDANet focus on the most relevant observed objects to the target. With the Siamese architecture (SA) design, TDANet distinguishes the difference between the current and target states and generates the domain-independent visual representation. To evaluate the navigation performance of TDANet, extensive experiments are conducted in the AI2-THOR embodied AI environment. The simulation results demonstrate a strong generalization ability of TDANet to unseen scenes and target objects, with higher navigation success rate (SR) and success weighted by length (SPL) than other state-of-the-art models. TDANet is finally deployed on a wheeled robot in real scenes, demonstrating satisfactory generalization of TDANet to the real world.

TDANet: Target-Directed Attention Network For Object-Goal Visual Navigation With Zero-Shot Ability

TL;DR

TDANet addresses the challenge of generalizing object-goal visual navigation to unseen scenes and unseen target objects by introducing a Target Attention (TA) module and a Siamese architecture that learn a domain-independent representation. The TA module learns spatial and semantic correspondences between observed objects and the target, while the Siamese branches compare current and target states to enable zero-shot navigation within an end-to-end A3C framework. Empirical results in AI2-THOR show TDANet outperforms state-of-the-art baselines in both seen and unseen scenarios, with substantial improvements in SR and SPL, and its real-world deployment on a TurtleBot4 confirms practical generalization. The combination of adaptive object-focused attention and state-difference learning yields robust navigation policies capable of operating with unseen objects in real environments.

Abstract

The generalization of the end-to-end deep reinforcement learning (DRL) for object-goal visual navigation is a long-standing challenge since object classes and placements vary in new test environments. Learning domain-independent visual representation is critical for enabling the trained DRL agent with the ability to generalize to unseen scenes and objects. In this letter, a target-directed attention network (TDANet) is proposed to learn the end-to-end object-goal visual navigation policy with zero-shot ability. TDANet features a novel target attention (TA) module that learns both the spatial and semantic relationships among objects to help TDANet focus on the most relevant observed objects to the target. With the Siamese architecture (SA) design, TDANet distinguishes the difference between the current and target states and generates the domain-independent visual representation. To evaluate the navigation performance of TDANet, extensive experiments are conducted in the AI2-THOR embodied AI environment. The simulation results demonstrate a strong generalization ability of TDANet to unseen scenes and target objects, with higher navigation success rate (SR) and success weighted by length (SPL) than other state-of-the-art models. TDANet is finally deployed on a wheeled robot in real scenes, demonstrating satisfactory generalization of TDANet to the real world.
Paper Structure (18 sections, 3 equations, 6 figures, 5 tables)

This paper contains 18 sections, 3 equations, 6 figures, 5 tables.

Figures (6)

  • Figure 1: Overview of TDANet. The input is the fused data of bounding boxes and word embeddings of the observed objects and the target object $t$. The target attention module learns the correspondence between the observed objects and the target object $t$ and selects features of the objects most relevant to the target. The Siamese architecture compares the current state with the target state and generates the visual representation. A3C DRL model a3c is adopted to learn the navigation policy and is trained with rewards from the environment.
  • Figure 2: The SR and SPL in the test set evaluated at each training episode of selected comparison models.
  • Figure 3: The visualization of the TA module. Only the bounding boxes of objects with higher correspondence values are labeled. The darker red color of a bounding box indicates a higher correspondence value. The target object is marked with the blue bounding box. The results demonstrate that the TA successfully learns the correspondence between the objects and the target.
  • Figure 4: The sampled trajectories of MJOLNIRmjol and our TDANet along with the number of navigation steps. Red and green trajectories represent success and failure, respectively. The white triangle indicates the field of view of the agent. The target object is marked with a blue bounding box.
  • Figure 5: The comparison of TDANet and the SA-only network for unseen object goal navigation. The target object Pillow is marked with the blue bounding box. (a) TDANet predicts the right action by focusing on objects related to the target. (b) Without the help of the TA module, the SA-only network is distracted by unrelated objects and predicts the wrong action.
  • ...and 1 more figures