Table of Contents
Fetching ...

WeatherPrompt: Multi-modality Representation Learning for All-Weather Drone Visual Geo-Localization

Jiahao Wen, Hang Yu, Zhedong Zheng

TL;DR

WeatherPrompt tackles the challenge of drone visual geo-localization under varied and unseen weather by introducing a training-free weather reasoning pipeline powered by large vision-language models with Chain-of-Thought prompting. It then couples this textual weather knowledge with a text-driven gating mechanism to fuse visual and textual features, producing weather-invariant representations for cross-view localization. The approach is trained with ITC, ITM, localized alignment, and CE losses, enabling robust geo-localization without online fine-tuning. Experiments on University-1652 and SUES-200 show substantial gains in recall and AP under night, fog, and snow conditions while maintaining strong performance across overall weather scenarios.

Abstract

Visual geo-localization for drones faces critical degradation under weather perturbations, \eg, rain and fog, where existing methods struggle with two inherent limitations: 1) Heavy reliance on limited weather categories that constrain generalization, and 2) Suboptimal disentanglement of entangled scene-weather features through pseudo weather categories. We present WeatherPrompt, a multi-modality learning paradigm that establishes weather-invariant representations through fusing the image embedding with the text context. Our framework introduces two key contributions: First, a Training-free Weather Reasoning mechanism that employs off-the-shelf large multi-modality models to synthesize multi-weather textual descriptions through human-like reasoning. It improves the scalability to unseen or complex weather, and could reflect different weather strength. Second, to better disentangle the scene and weather feature, we propose a multi-modality framework with the dynamic gating mechanism driven by the text embedding to adaptively reweight and fuse visual features across modalities. The framework is further optimized by the cross-modal objectives, including image-text contrastive learning and image-text matching, which maps the same scene with different weather conditions closer in the respresentation space. Extensive experiments validate that, under diverse weather conditions, our method achieves competitive recall rates compared to state-of-the-art drone geo-localization methods. Notably, it improves Recall@1 by +13.37\% under night conditions and by 18.69\% under fog and snow conditions.

WeatherPrompt: Multi-modality Representation Learning for All-Weather Drone Visual Geo-Localization

TL;DR

WeatherPrompt tackles the challenge of drone visual geo-localization under varied and unseen weather by introducing a training-free weather reasoning pipeline powered by large vision-language models with Chain-of-Thought prompting. It then couples this textual weather knowledge with a text-driven gating mechanism to fuse visual and textual features, producing weather-invariant representations for cross-view localization. The approach is trained with ITC, ITM, localized alignment, and CE losses, enabling robust geo-localization without online fine-tuning. Experiments on University-1652 and SUES-200 show substantial gains in recall and AP under night, fog, and snow conditions while maintaining strong performance across overall weather scenarios.

Abstract

Visual geo-localization for drones faces critical degradation under weather perturbations, \eg, rain and fog, where existing methods struggle with two inherent limitations: 1) Heavy reliance on limited weather categories that constrain generalization, and 2) Suboptimal disentanglement of entangled scene-weather features through pseudo weather categories. We present WeatherPrompt, a multi-modality learning paradigm that establishes weather-invariant representations through fusing the image embedding with the text context. Our framework introduces two key contributions: First, a Training-free Weather Reasoning mechanism that employs off-the-shelf large multi-modality models to synthesize multi-weather textual descriptions through human-like reasoning. It improves the scalability to unseen or complex weather, and could reflect different weather strength. Second, to better disentangle the scene and weather feature, we propose a multi-modality framework with the dynamic gating mechanism driven by the text embedding to adaptively reweight and fuse visual features across modalities. The framework is further optimized by the cross-modal objectives, including image-text contrastive learning and image-text matching, which maps the same scene with different weather conditions closer in the respresentation space. Extensive experiments validate that, under diverse weather conditions, our method achieves competitive recall rates compared to state-of-the-art drone geo-localization methods. Notably, it improves Recall@1 by +13.37\% under night conditions and by 18.69\% under fog and snow conditions.

Paper Structure

This paper contains 10 sections, 9 equations, 4 figures, 3 tables.

Figures (4)

  • Figure 1: Example of the proposed Chain-of-Thought description and matching. Our framework generates structured weather and spatial Text Description via stepwise reasoning. We leverage Off-the-shelf Visual Grounding Model (VGM), i.e., XVLM xvlm36 to extract local region cues, which are integrated to further refine the matching process. Finally, we match images using weather description, global scene layout, and local region semantics to retrieve the corresponding satellite-view image.
  • Figure 2: The proposed training-free weather reasoning mechanism. We synthesize drone-view images with diverse weather conditions based on the University-1652 and SUES-200 datasets, covering complex scenarios such as fog, rain, snow, and nighttime. For each synthesized image, we first employ stepwise Chain-of-Thought prompting to generate open-set weather descriptions, including global assessment, local detail analysis, and weather inference. Guided by the inferred weather prior, we then sequentially reason about the scene’s macro layout, structural elements, and topological relationships, ultimately producing high-quality, structured image–text pairs.
  • Figure 3: The proposed multimodal alignment framework. Our model extracts global and local features from drone images and multi-step weather captions, performs multi-granular image-text alignment, and dynamically fuses modalities via weather-driven gating for robust geo-localization.
  • Figure 4: Qualitative comparison under varying weather. While existing methods perform reliably under clear weather, their accuracy drops markedly in adverse conditions. Our approach maintains superior localization performance, especially when drone images are severely affected by weather. Green boxes indicate correct matches, while images in red boxes represent incorrect matches.