TDANet: Target-Directed Attention Network For Object-Goal Visual Navigation With Zero-Shot Ability
Shiwei Lian, Feitian Zhang
TL;DR
TDANet addresses the challenge of generalizing object-goal visual navigation to unseen scenes and unseen target objects by introducing a Target Attention (TA) module and a Siamese architecture that learn a domain-independent representation. The TA module learns spatial and semantic correspondences between observed objects and the target, while the Siamese branches compare current and target states to enable zero-shot navigation within an end-to-end A3C framework. Empirical results in AI2-THOR show TDANet outperforms state-of-the-art baselines in both seen and unseen scenarios, with substantial improvements in SR and SPL, and its real-world deployment on a TurtleBot4 confirms practical generalization. The combination of adaptive object-focused attention and state-difference learning yields robust navigation policies capable of operating with unseen objects in real environments.
Abstract
The generalization of the end-to-end deep reinforcement learning (DRL) for object-goal visual navigation is a long-standing challenge since object classes and placements vary in new test environments. Learning domain-independent visual representation is critical for enabling the trained DRL agent with the ability to generalize to unseen scenes and objects. In this letter, a target-directed attention network (TDANet) is proposed to learn the end-to-end object-goal visual navigation policy with zero-shot ability. TDANet features a novel target attention (TA) module that learns both the spatial and semantic relationships among objects to help TDANet focus on the most relevant observed objects to the target. With the Siamese architecture (SA) design, TDANet distinguishes the difference between the current and target states and generates the domain-independent visual representation. To evaluate the navigation performance of TDANet, extensive experiments are conducted in the AI2-THOR embodied AI environment. The simulation results demonstrate a strong generalization ability of TDANet to unseen scenes and target objects, with higher navigation success rate (SR) and success weighted by length (SPL) than other state-of-the-art models. TDANet is finally deployed on a wheeled robot in real scenes, demonstrating satisfactory generalization of TDANet to the real world.
