DiaLoc: An Iterative Approach to Embodied Dialog Localization
Chao Zhang, Mohan Li, Ignas Budvytis, Stephan Liwicki
TL;DR
DiaLoc introduces an iterative, dialog-driven localization framework that continually fuses vision and language across turns to localize an observer on a top-down map. Built on Transformer encoders with explicit and implicit fusion variants, it outputs per-turn location predictions and optimizes with a multi-shot loss plus an auxiliary diversity term. The approach achieves state-of-the-art results on the WAY dataset in both single-shot and multi-shot settings, notably improving Acc5 on unseen environments, and demonstrates improved generalization and early stopping potential. By aligning localization with human-like iterative querying, DiaLoc narrows the sim-to-real gap and enables more efficient, robust collaborative localization for search-and-rescue and related tasks.
Abstract
Multimodal learning has advanced the performance for many vision-language tasks. However, most existing works in embodied dialog research focus on navigation and leave the localization task understudied. The few existing dialog-based localization approaches assume the availability of entire dialog prior to localizaiton, which is impractical for deployed dialog-based localization. In this paper, we propose DiaLoc, a new dialog-based localization framework which aligns with a real human operator behavior. Specifically, we produce an iterative refinement of location predictions which can visualize current pose believes after each dialog turn. DiaLoc effectively utilizes the multimodal data for multi-shot localization, where a fusion encoder fuses vision and dialog information iteratively. We achieve state-of-the-art results on embodied dialog-based localization task, in single-shot (+7.08% in Acc5@valUnseen) and multi-shot settings (+10.85% in Acc5@valUnseen). DiaLoc narrows the gap between simulation and real-world applications, opening doors for future research on collaborative localization and navigation.
