Table of Contents
Fetching ...

TIGeR: A Unified Framework for Time, Images and Geo-location Retrieval

David G. Shatwell, Sirnam Swetha, Mubarak Shah

Abstract

Many real-world applications in digital forensics, urban monitoring, and environmental analysis require jointly reasoning about visual appearance, location, and time. Beyond standard geo-localization and time-of-capture prediction, these applications increasingly demand more complex capabilities, such as retrieving an image captured at the same location as a query image but at a specified target time. We formalize this problem as Geo-Time Aware Image Retrieval and propose TIGeR, a unified framework for Time, Images and Geo-location Retrieval. TIGeR supports flexible input configurations (single-modality and multi-modality queries) and uses the same representation to perform (i) geo-localization, (ii) time-of-capture prediction, and (iii) geo-time-aware retrieval. By preserving the underlying location identity despite large appearance changes, TIGeR enables retrieval based on where and when a scene was captured, rather than purely on visual similarity. To support this task, we design a multistage data curation pipeline and propose a new diverse dataset of 4.5M paired image-location-time triplets for training and 86k high-quality triplets for evaluation. Extensive experiments show that TIGeR consistently outperforms strong baselines and state-of-the-art methods by up to 16% on time-of-year, 8% time-of-day prediction, and 14% in geo-time aware retrieval recall, highlighting the benefits of unified geo-temporal modeling.

TIGeR: A Unified Framework for Time, Images and Geo-location Retrieval

Abstract

Many real-world applications in digital forensics, urban monitoring, and environmental analysis require jointly reasoning about visual appearance, location, and time. Beyond standard geo-localization and time-of-capture prediction, these applications increasingly demand more complex capabilities, such as retrieving an image captured at the same location as a query image but at a specified target time. We formalize this problem as Geo-Time Aware Image Retrieval and propose TIGeR, a unified framework for Time, Images and Geo-location Retrieval. TIGeR supports flexible input configurations (single-modality and multi-modality queries) and uses the same representation to perform (i) geo-localization, (ii) time-of-capture prediction, and (iii) geo-time-aware retrieval. By preserving the underlying location identity despite large appearance changes, TIGeR enables retrieval based on where and when a scene was captured, rather than purely on visual similarity. To support this task, we design a multistage data curation pipeline and propose a new diverse dataset of 4.5M paired image-location-time triplets for training and 86k high-quality triplets for evaluation. Extensive experiments show that TIGeR consistently outperforms strong baselines and state-of-the-art methods by up to 16% on time-of-year, 8% time-of-day prediction, and 14% in geo-time aware retrieval recall, highlighting the benefits of unified geo-temporal modeling.

Paper Structure

This paper contains 33 sections, 12 equations, 12 figures, 10 tables.

Figures (12)

  • Figure 1: TIGeR unifies time, image, and location understanding, enabling geo-localization, time prediction, and geo-time aware image retrieval: given a query image and a target time, it retrieves an image captured at the same location at the specified time.
  • Figure 2: Architecture of TIGeR. Given an image $I$, GPS coordinates $l$, and timestamp $t$, modality-specific encoders produce visual ($V$), location ($L$), and temporal ($T$) embeddings. Individual and pair-wise modality tokens are passed through a shared multi-modal transformer$\mathcal{F}(\cdot)$, which employs self-attention to learn geo-temporal interactions. Pooled embeddings are projected into a shared embedding space, where a contrastive loss aligns unimodal and fused representations across all modality pairings (e.g., $V$--$L$, $V$--$T$, $L$--$T$). In parallel, the classification losses with metric targets on discretized location and time classes encourage the model to learn hierarchical cross-modal representations that vary smoothly across the embedding space.
  • Figure 3: Benchmark dataset curation pipeline. Starting from the AMOS corpus, we randomly sample 1,255 static cameras with broad global coverage and identify diverse corruption modes. We manually label 4,000 images for quality, train a semi-supervised image quality classifier on frozen visual features, and apply it to 8M unlabeled frames to predict a quality score $P(H|I)$. Using thresholds $T_H{=}0.7$ and $T_L{=}0.4$, images are partitioned into high-, medium-, and low-quality subsets, and low-quality frames are discarded. High-quality images are then used to construct a geographically balanced test set by sampling cameras across $10^\circ \times 10^\circ$ latitude-longitude bins and retaining only cameras with at least 500 frames spanning a year, while the remaining high- and medium-quality images form the training set. The resulting benchmark contains 4.5M training images and 86k test images, with no camera overlap between splits.
  • Figure 4: Dataset statistics and distributions.Top: Geographical distribution of camera locations across the world in the proposed dataset, showing wide coverage across multiple continents. Left: Time-of-year (month) distribution of the captured images. Right: Time-of-day (hour) distribution of the captured images.
  • Figure 5: Qualitative results on geo-time aware image retrieval with images from unseen locations. Given a query image $I^Q$ and a target time $t^Q$ (shown above each column), the goal is to retrieve an image captured at the same location as the query but at the specified target time. For visualization, the actual time-of-capture for each image is overlaid on top of it. Each row shows top-1 retrieval results from different methods. While prior approaches (Zhai et al. zhai2019learning, GT-Loc Shatwell_2025_ICCV) often retrieve images captured at incorrect times, our TIGeR model accurately retrieves images corresponding to the target time while maintaining location consistency. Best viewed in color.
  • ...and 7 more figures