Table of Contents
Fetching ...

DiaLoc: An Iterative Approach to Embodied Dialog Localization

Chao Zhang, Mohan Li, Ignas Budvytis, Stephan Liwicki

TL;DR

DiaLoc introduces an iterative, dialog-driven localization framework that continually fuses vision and language across turns to localize an observer on a top-down map. Built on Transformer encoders with explicit and implicit fusion variants, it outputs per-turn location predictions and optimizes with a multi-shot loss plus an auxiliary diversity term. The approach achieves state-of-the-art results on the WAY dataset in both single-shot and multi-shot settings, notably improving Acc5 on unseen environments, and demonstrates improved generalization and early stopping potential. By aligning localization with human-like iterative querying, DiaLoc narrows the sim-to-real gap and enables more efficient, robust collaborative localization for search-and-rescue and related tasks.

Abstract

Multimodal learning has advanced the performance for many vision-language tasks. However, most existing works in embodied dialog research focus on navigation and leave the localization task understudied. The few existing dialog-based localization approaches assume the availability of entire dialog prior to localizaiton, which is impractical for deployed dialog-based localization. In this paper, we propose DiaLoc, a new dialog-based localization framework which aligns with a real human operator behavior. Specifically, we produce an iterative refinement of location predictions which can visualize current pose believes after each dialog turn. DiaLoc effectively utilizes the multimodal data for multi-shot localization, where a fusion encoder fuses vision and dialog information iteratively. We achieve state-of-the-art results on embodied dialog-based localization task, in single-shot (+7.08% in Acc5@valUnseen) and multi-shot settings (+10.85% in Acc5@valUnseen). DiaLoc narrows the gap between simulation and real-world applications, opening doors for future research on collaborative localization and navigation.

DiaLoc: An Iterative Approach to Embodied Dialog Localization

TL;DR

DiaLoc introduces an iterative, dialog-driven localization framework that continually fuses vision and language across turns to localize an observer on a top-down map. Built on Transformer encoders with explicit and implicit fusion variants, it outputs per-turn location predictions and optimizes with a multi-shot loss plus an auxiliary diversity term. The approach achieves state-of-the-art results on the WAY dataset in both single-shot and multi-shot settings, notably improving Acc5 on unseen environments, and demonstrates improved generalization and early stopping potential. By aligning localization with human-like iterative querying, DiaLoc narrows the sim-to-real gap and enables more efficient, robust collaborative localization for search-and-rescue and related tasks.

Abstract

Multimodal learning has advanced the performance for many vision-language tasks. However, most existing works in embodied dialog research focus on navigation and leave the localization task understudied. The few existing dialog-based localization approaches assume the availability of entire dialog prior to localizaiton, which is impractical for deployed dialog-based localization. In this paper, we propose DiaLoc, a new dialog-based localization framework which aligns with a real human operator behavior. Specifically, we produce an iterative refinement of location predictions which can visualize current pose believes after each dialog turn. DiaLoc effectively utilizes the multimodal data for multi-shot localization, where a fusion encoder fuses vision and dialog information iteratively. We achieve state-of-the-art results on embodied dialog-based localization task, in single-shot (+7.08% in Acc5@valUnseen) and multi-shot settings (+10.85% in Acc5@valUnseen). DiaLoc narrows the gap between simulation and real-world applications, opening doors for future research on collaborative localization and navigation.
Paper Structure (23 sections, 4 equations, 10 figures, 6 tables)

This paper contains 23 sections, 4 equations, 10 figures, 6 tables.

Figures (10)

  • Figure 1: Illustration of iterative embodied dialog localization. The locator with top-down map and the observer with egocentric view engage in a cooperative dialog to assist in determining the observer's location. The locator iteratively forms predictions to enhance the estimations. The preceding dialog exchanges and cumulative predictions also influence the manner in which the locator poses new questions. As depicted, the locator predicts 3 possible locations based on the first turn. The second question is asked to disambiguate the predictions and the correct location is predicted given the answer from the observer.
  • Figure 2: DiaLoc-e: the proposed multi-shot multimodal architecture for embodied dialog localization. The approach employs the image encoder and the frozen text encoder to derive visual and linguistic unimodal embeddings. The fusion encoder integrates the unimodal inputs to update the hidden state. Multi-shot predictions are produced using the hidden state at varying timesteps. The fusion encoder comprises $N$ blocks of Transformer encoder with cross attention layer.
  • Figure 3: DiaLoc-i: the proposed localizer variant with implicit fusion design. The hidden state is continuously updated with dialog information at each timestep.
  • Figure 4: Qualitative results of single-shot and multi-shot location predictions are presented. In the first column, the visual map is displayed alongside its corresponding ground truth (GT) location. The second column displays the single-shot predictions, with LingUNet results above and DiaLoc results below. The last three columns showcase the multi-shot predictions. Regarding the results of valseen case 67, DiaLoc effectively corrects the prediction after the second turn, whereas LingUNet-ms produces noisy distributions. For valUnseen case 245, both LingUNet and DiaLoc failed in the single-shot mode. Nevertheless, in the multi-shot mode, DiaLoc succeeds in refining the prediction. In contrast, LingUNet-ms converges towards an incorrect area. Localization Error (LE) is displayed for the predictions.
  • Figure 5: CMC curves on WAY dataset. We depict the CMC curves for both our DiaLoc and LingUNet for single-shot and multi-shot settings. DiaLoc consistently outperforms the baseline. X-axis denotes the error threshold for LE and the Y-axis denotes the success rate.
  • ...and 5 more figures