Table of Contents
Fetching ...

What Time Tells Us? An Explorative Study of Time Awareness Learned from Static Images

Dongheng Lin, Han Hu, Jianbo Jiao

TL;DR

This work tackles the problem of teaching neural networks to acquire time awareness from static images. It introduces the Time-Oriented Collection (TOC), a large, reliably timestamped dataset, and Time-Image Contrastive Learning (TICL), a cross-modal framework that aligns image representations with learnable time embeddings using a frozen CLIP backbone and a Time Encoder coupled with an Image-Time Adaptor. TICL achieves state-of-the-art timestamp estimation on TOC and demonstrates that time-aware embeddings benefit downstream tasks such as time-based image retrieval, video scene classification, and time-aware image editing. The findings indicate that time-related visual cues can be learned from static images and that these embeddings provide practical priors for broader vision tasks, offering a foundation for future exploration of time-aware visual context.

Abstract

Time becomes visible through illumination changes in what we see. Inspired by this, in this paper we explore the potential to learn time awareness from static images, trying to answer: *what time tells us?* To this end, we first introduce a Time-Oriented Collection (TOC) dataset, which contains 130,906 images with reliable timestamps. Leveraging this dataset, we propose a Time-Image Contrastive Learning (TICL) approach to jointly model timestamps and related visual representations through cross-modal contrastive learning. We found that the proposed TICL, 1) not only achieves state-of-the-art performance on the timestamp estimation task, over various benchmark metrics, 2) but also, interestingly, though only seeing static images, the time-aware embeddings learned from TICL show strong capability in several time-aware downstream tasks such as time-based image retrieval, video scene classification, and time-aware image editing. Our findings suggest that time-related visual cues can be learned from static images and are beneficial for various vision tasks, laying a foundation for future research on understanding time-related visual context. Project page: https://rathgrith.github.io/timetells_release/

What Time Tells Us? An Explorative Study of Time Awareness Learned from Static Images

TL;DR

This work tackles the problem of teaching neural networks to acquire time awareness from static images. It introduces the Time-Oriented Collection (TOC), a large, reliably timestamped dataset, and Time-Image Contrastive Learning (TICL), a cross-modal framework that aligns image representations with learnable time embeddings using a frozen CLIP backbone and a Time Encoder coupled with an Image-Time Adaptor. TICL achieves state-of-the-art timestamp estimation on TOC and demonstrates that time-aware embeddings benefit downstream tasks such as time-based image retrieval, video scene classification, and time-aware image editing. The findings indicate that time-related visual cues can be learned from static images and that these embeddings provide practical priors for broader vision tasks, offering a foundation for future exploration of time-aware visual context.

Abstract

Time becomes visible through illumination changes in what we see. Inspired by this, in this paper we explore the potential to learn time awareness from static images, trying to answer: *what time tells us?* To this end, we first introduce a Time-Oriented Collection (TOC) dataset, which contains 130,906 images with reliable timestamps. Leveraging this dataset, we propose a Time-Image Contrastive Learning (TICL) approach to jointly model timestamps and related visual representations through cross-modal contrastive learning. We found that the proposed TICL, 1) not only achieves state-of-the-art performance on the timestamp estimation task, over various benchmark metrics, 2) but also, interestingly, though only seeing static images, the time-aware embeddings learned from TICL show strong capability in several time-aware downstream tasks such as time-based image retrieval, video scene classification, and time-aware image editing. Our findings suggest that time-related visual cues can be learned from static images and are beneficial for various vision tasks, laying a foundation for future research on understanding time-related visual context. Project page: https://rathgrith.github.io/timetells_release/

Paper Structure

This paper contains 60 sections, 11 equations, 35 figures, 13 tables.

Figures (35)

  • Figure 1: An overview of our study, in which we presented a new high-quality dataset for time-of-day estimation (a), based on which we propose a new approach, achieving state-of-the-art performance (b). We further explore the implications of learned time-aware embeddings (c), showing effectiveness over several time-related downstream tasks.
  • Figure 2: Overview of TICL. Given static images and one-hot time labels, two encoders (Time Encoder and image encoder + ITA) project inputs into a shared feature space; a contrastive loss aligns the corresponding pairs.
  • Figure 3: Sample images and metadata from the TOC dataset w.r.t. GPS coordinates. Metadata contains several fields indicating timestamps and geolocations. The samples spread across all the continents and show a natural distribution of internet images, where the southern hemisphere has relatively fewer samples due to a sparser population of photo capturing.
  • Figure 4: Confusion matrices. They provide more detailed comparisons throughout the 24 hours on our TOC test set (top), and the AMOS test set (bottom).
  • Figure 5: Recall@k for time-based image retrieval.
  • ...and 30 more figures