Table of Contents
Fetching ...

Interpretable Perception and Reasoning for Audiovisual Geolocation

Yiyang Su, Xiaoming Liu

TL;DR

Results entail that interpretable perception of the soundscape provides a critical, orthogonal signal that, when coupled with multimodal reasoning, enables high-precision global localization.

Abstract

While recent advances in Multimodal Large Language Models (MLLMs) have improved image-based localization, precise global geolocation remains a formidable challenge due to the inherent ambiguity of visual landscapes and the largely untapped potential of auditory cues. In this paper, we introduce Audiovisual Geolocation, a framework designed to resolve geographic ambiguity through interpretable perception and reasoning. We present AVG, a high-quality global-scale video benchmark for geolocation, comprising 20,000 curated clips across 1,000 distinct locations. To address the complexity of audiovisual geolocation, we propose a three-stage framework: (1) a Perception stage that utilizes a mixture-autoregressive sparse autoencoder to decompose noisy audio into semantically grounded "acoustic atoms"; (2) a Multimodal Reasoning stage that employs an MLLM finetuned via Group Relative Policy Optimization (GRPO) to synthesize these atoms with visual features; and (3) a Precision Prediction stage using Riemannian Flow Matching on the $S^2$ manifold. Our experiments demonstrate that our framework significantly outperforms unimodal baselines. These results entail that interpretable perception of the soundscape provides a critical, orthogonal signal that, when coupled with multimodal reasoning, enables high-precision global localization.

Interpretable Perception and Reasoning for Audiovisual Geolocation

TL;DR

Results entail that interpretable perception of the soundscape provides a critical, orthogonal signal that, when coupled with multimodal reasoning, enables high-precision global localization.

Abstract

While recent advances in Multimodal Large Language Models (MLLMs) have improved image-based localization, precise global geolocation remains a formidable challenge due to the inherent ambiguity of visual landscapes and the largely untapped potential of auditory cues. In this paper, we introduce Audiovisual Geolocation, a framework designed to resolve geographic ambiguity through interpretable perception and reasoning. We present AVG, a high-quality global-scale video benchmark for geolocation, comprising 20,000 curated clips across 1,000 distinct locations. To address the complexity of audiovisual geolocation, we propose a three-stage framework: (1) a Perception stage that utilizes a mixture-autoregressive sparse autoencoder to decompose noisy audio into semantically grounded "acoustic atoms"; (2) a Multimodal Reasoning stage that employs an MLLM finetuned via Group Relative Policy Optimization (GRPO) to synthesize these atoms with visual features; and (3) a Precision Prediction stage using Riemannian Flow Matching on the manifold. Our experiments demonstrate that our framework significantly outperforms unimodal baselines. These results entail that interpretable perception of the soundscape provides a critical, orthogonal signal that, when coupled with multimodal reasoning, enables high-precision global localization.
Paper Structure (34 sections, 5 equations, 4 figures, 4 tables, 1 algorithm)

This paper contains 34 sections, 5 equations, 4 figures, 4 tables, 1 algorithm.

Figures (4)

  • Figure 1: In the visual-only approach, ambiguous features like trees and bridges can lead to multiple location candidates. In the audio-only approach, overlapping urban sounds create complex signals that are hard to decipher. Combining modality cues, our proposed framework disambiguates candidates to pinpoint the correct location.
  • Figure 2: We process audiovisual input in three steps. (1) Perception: visual and audio encoders that extract interpretable elements from foundation models. (2) Reasoning: an MLLM, fine-tuned via GRPO, analyzes these attributes to generate geographically-rich embeddings. (3) Prediction: a Riemann Flow Matching model generate a probability density function on the Earth's surface conditioned on the reasoning output.
  • Figure 3: (a) We randomly sample clips from AudioSet to generate synthetic mixtures by a weighted sum, where the weights are randomly assigned but monotonically decreasing gains ($g_1 > g_2 > \dots$). (b) We employ an autoregressive pipeline to iteratively decompose the audio. In each iteration, we select the kernel from the dictionary with the highest activation and subtract the reconstructed audio from the mixture. The cross-entropy loss ensures the correct semantic labels and reconstruction loss minimizes the difference between the reconstructed audio and the original audio.
  • Figure 4: Qualitative result of interpretable perception and reasoning.