Table of Contents
Fetching ...

TimeSenCLIP: A Time Series Vision-Language Model for Remote Sensing Using Single-Pixel

Pallavi Jain, Diego Marcos, Dino Ienco, Roberto Interdonato, Tristan Berchoux

TL;DR

TimeSenCLIP addresses the need for open-vocabulary, scalable remote sensing analysis by aligning Sentinel-2 multispectral time series with geo-tagged ground-level imagery through cross-view contrastive learning, avoiding caption supervision. The model uses a frozen ground-level CLIP encoder with attention pooling and a trainable transformer for the spectral-temporal satellite stream, trained with a memory-bank-based InfoNCE objective. Evaluated across land-use/land-cover, habitat, crop types, bioregions, and scenicness using the LUCAS/Sen4Map datasets, TimeSenCLIP consistently outperforms prior CLIP-based RS models, with single-pixel time-series often matching or exceeding larger patches. The findings emphasize that temporal dynamics and spectral information can drive semantic understanding in medium-resolution RS, offering a computationally efficient path to open-vocabulary RS tasks and scalable monitoring, while highlighting the value of temporal augmentation and prompt design in zero-shot settings.

Abstract

Vision-language models (VLMs) have shown significant promise in remote sensing applications, particularly for land-use and land-cover (LULC) mapping via zero-shot classification and retrieval. However, current approaches face several key challenges, such as the dependence on caption-based supervision, which is often not available or very limited in terms of the covered semantics, and the fact of being adapted from generic VLM architectures that are suitable for very high resolution images. Consequently, these models tend to prioritize spatial context over spectral and temporal information, limiting their effectiveness for medium-resolution remote sensing imagery. In this work, we present TimeSenCLIP, a lightweight VLM for remote sensing time series, using a cross-view temporal contrastive framework to align multispectral Sentinel-2 time series with geo-tagged ground-level imagery, without requiring textual annotations. Unlike prior VLMs, TimeSenCLIP emphasizes temporal and spectral signals over spatial context, investigating whether single-pixel time series contain sufficient information for solving a variety of tasks.

TimeSenCLIP: A Time Series Vision-Language Model for Remote Sensing Using Single-Pixel

TL;DR

TimeSenCLIP addresses the need for open-vocabulary, scalable remote sensing analysis by aligning Sentinel-2 multispectral time series with geo-tagged ground-level imagery through cross-view contrastive learning, avoiding caption supervision. The model uses a frozen ground-level CLIP encoder with attention pooling and a trainable transformer for the spectral-temporal satellite stream, trained with a memory-bank-based InfoNCE objective. Evaluated across land-use/land-cover, habitat, crop types, bioregions, and scenicness using the LUCAS/Sen4Map datasets, TimeSenCLIP consistently outperforms prior CLIP-based RS models, with single-pixel time-series often matching or exceeding larger patches. The findings emphasize that temporal dynamics and spectral information can drive semantic understanding in medium-resolution RS, offering a computationally efficient path to open-vocabulary RS tasks and scalable monitoring, while highlighting the value of temporal augmentation and prompt design in zero-shot settings.

Abstract

Vision-language models (VLMs) have shown significant promise in remote sensing applications, particularly for land-use and land-cover (LULC) mapping via zero-shot classification and retrieval. However, current approaches face several key challenges, such as the dependence on caption-based supervision, which is often not available or very limited in terms of the covered semantics, and the fact of being adapted from generic VLM architectures that are suitable for very high resolution images. Consequently, these models tend to prioritize spatial context over spectral and temporal information, limiting their effectiveness for medium-resolution remote sensing imagery. In this work, we present TimeSenCLIP, a lightweight VLM for remote sensing time series, using a cross-view temporal contrastive framework to align multispectral Sentinel-2 time series with geo-tagged ground-level imagery, without requiring textual annotations. Unlike prior VLMs, TimeSenCLIP emphasizes temporal and spectral signals over spatial context, investigating whether single-pixel time series contain sufficient information for solving a variety of tasks.

Paper Structure

This paper contains 54 sections, 9 equations, 10 figures, 11 tables.

Figures (10)

  • Figure 1: TimeSenCLIP Pipeline Illustration: Satellite Sentinel-2 single pixel multispectral time series are aligned with geo-tagged ground-level images through cross-view learning, enabling the model to capture fine-grained ecological semantics without relying on large spatial context or text supervision.
  • Figure 2: TimeSenCLIP Model Training: Spectral-temporal patches from Sentinel-2 are aligned with ground-level CLIP features using contrastive learning. The satellite encoder learns from minimal spatial input via a transformer, while a memory queue enables efficient negative sampling.
  • Figure 3: Land cover class distribution across the EU in the evaluation dataset.
  • Figure 4: The infographic presents the comprehensive evaluation workflow of the TimeSenCLIP model. The pipeline integrates three key components: (1) Zero-shot classification, where the trained TimeSenCLIP (time series encoder) and CLIP text encoder (text prompt encoder) are used to perform inference across multiple downstream tasks. (2) Image-to-image retrieval, where satellite and ground-level image embeddings are generated using the CLIP image encoder, and similarity scores are computed in both Satellite-to-Ground (S2G) and Ground-to-Satellite (G2S) directions to assess class-consistent retrieval performance; and (3) Scenicness assessment, which evaluates perceptual and aesthetic qualities of landscapes using the learned visual embeddings
  • Figure 5: Normalized confusion matrices for TimeSenCLIP, showing the top 8 predicted classes for LULC and the top 10 predicted classes for Crops and Habitat. Classes beyond these top predictions are grouped as "Other‚Äù. All evaluations are performed in a zeroshot setting on monthly, single-pixel TimeSenCLIP inputs using descriptive text prompts.
  • ...and 5 more figures