Table of Contents
Fetching ...

AddressCLIP: Empowering Vision-Language Models for City-wide Image Address Localization

Shixiong Xu, Chenghao Zhang, Lubin Fan, Gaofeng Meng, Shiming Xiang, Jieping Ye

TL;DR

AddressCLIP tackles Image Address Localization (IAL) by predicting human-readable addresses from city-scale images, avoiding retrieval-based pipelines. It fuses image-text alignment with an additive scene caption and a geography-aware constraint to produce end-to-end address predictions, formalized through a joint loss and a semantic address partition strategy. The authors introduce Pitts-IAL, SF-IAL-Base, and SF-IAL-Large datasets with street-level annotations and show that AddressCLIP achieves Top-1 accuracy above 80% across datasets, up to 85.92% on SF-IAL-Large, outperforming CLIP-transfer baselines by roughly 3–6%. They also provide extensive ablations, qualitative visualizations, and a comparison against the Image-GPS-Address pipeline, and discuss the potential of multimodal LLMs for future improvements and interactive geographic reasoning.

Abstract

In this study, we introduce a new problem raised by social media and photojournalism, named Image Address Localization (IAL), which aims to predict the readable textual address where an image was taken. Existing two-stage approaches involve predicting geographical coordinates and converting them into human-readable addresses, which can lead to ambiguity and be resource-intensive. In contrast, we propose an end-to-end framework named AddressCLIP to solve the problem with more semantics, consisting of two key ingredients: i) image-text alignment to align images with addresses and scene captions by contrastive learning, and ii) image-geography matching to constrain image features with the spatial distance in terms of manifold learning. Additionally, we have built three datasets from Pittsburgh and San Francisco on different scales specifically for the IAL problem. Experiments demonstrate that our approach achieves compelling performance on the proposed datasets and outperforms representative transfer learning methods for vision-language models. Furthermore, extensive ablations and visualizations exhibit the effectiveness of the proposed method. The datasets and source code are available at https://github.com/xsx1001/AddressCLIP.

AddressCLIP: Empowering Vision-Language Models for City-wide Image Address Localization

TL;DR

AddressCLIP tackles Image Address Localization (IAL) by predicting human-readable addresses from city-scale images, avoiding retrieval-based pipelines. It fuses image-text alignment with an additive scene caption and a geography-aware constraint to produce end-to-end address predictions, formalized through a joint loss and a semantic address partition strategy. The authors introduce Pitts-IAL, SF-IAL-Base, and SF-IAL-Large datasets with street-level annotations and show that AddressCLIP achieves Top-1 accuracy above 80% across datasets, up to 85.92% on SF-IAL-Large, outperforming CLIP-transfer baselines by roughly 3–6%. They also provide extensive ablations, qualitative visualizations, and a comparison against the Image-GPS-Address pipeline, and discuss the potential of multimodal LLMs for future improvements and interactive geographic reasoning.

Abstract

In this study, we introduce a new problem raised by social media and photojournalism, named Image Address Localization (IAL), which aims to predict the readable textual address where an image was taken. Existing two-stage approaches involve predicting geographical coordinates and converting them into human-readable addresses, which can lead to ambiguity and be resource-intensive. In contrast, we propose an end-to-end framework named AddressCLIP to solve the problem with more semantics, consisting of two key ingredients: i) image-text alignment to align images with addresses and scene captions by contrastive learning, and ii) image-geography matching to constrain image features with the spatial distance in terms of manifold learning. Additionally, we have built three datasets from Pittsburgh and San Francisco on different scales specifically for the IAL problem. Experiments demonstrate that our approach achieves compelling performance on the proposed datasets and outperforms representative transfer learning methods for vision-language models. Furthermore, extensive ablations and visualizations exhibit the effectiveness of the proposed method. The datasets and source code are available at https://github.com/xsx1001/AddressCLIP.
Paper Structure (37 sections, 8 equations, 14 figures, 14 tables)

This paper contains 37 sections, 8 equations, 14 figures, 14 tables.

Figures (14)

  • Figure 1: Comparison of image-based geo-localization and address localization tasks. The objective of the proposed task is to predict the semantic text address of a given image instead of a digital GPS coordinate without the need for a retrieval gallery.
  • Figure 2: The problem statement of the image address localization task consists of (a) examples of administrative address and hierarchy, (b) semantic address partition strategy, and (c) address predicting using visual-language models.
  • Figure 3: Overview of the proposed AddressCLIP framework. (a) During training, the alignment of image and address is learned by the image-address contrastive loss, image-caption contrastive loss, and image-geography matching loss. (b) At inference, the address with the highest similarity to the query image’s embedding is chosen.
  • Figure 4: Visualizations of the introduced datasets. Distinct semantic street partitions are displayed using varying colors.
  • Figure 5: Performance of different backbones on the proposed datasets.
  • ...and 9 more figures