Table of Contents
Fetching ...

UniTS: Unified Time Series Generative Model for Remote Sensing

Yuxiang Zhang, Shunlin Liang, Wenyuan Li, Han Ma, Jianglei Xu, Yichuan Ma, Jiangwei Xie, Wei Li, Mengmeng Zhang, Ran Tao, Xiang-Gen Xia

TL;DR

This work addresses the lack of a unified model for diverse remote-sensing time-series tasks by introducing UniTS, a flow-matching based generative framework built on a diffusion transformer with spatiotemporal blocks. It jointly models time-series reconstruction, cloud removal, semantic change detection, and forecasting, leveraging Adaptive Condition Injector and Spatiotemporal-aware Modulator to handle multimodal conditioning and spatiotemporal priors. The authors also present TS-S12 and TS-S12CR, two high-quality benchmarks for cloud removal and forecasting in realistic settings. Across reconstruction, cloud removal, semantic change detection, and forecasting, UniTS outperforms task-specific and foundation models, showing strong generative and cognitive capabilities even under heavy cloud cover or modality absence, highlighting its potential for practical Earth observation and future world-model development.

Abstract

One of the primary objectives of satellite remote sensing is to capture the complex dynamics of the Earth environment, which encompasses tasks such as reconstructing continuous cloud-free time series images, detecting land cover changes, and forecasting future surface evolution. However, existing methods typically require specialized models tailored to different tasks, lacking unified modeling of spatiotemporal features across multiple time series tasks. In this paper, we propose a Unified Time Series Generative Model (UniTS), a general framework applicable to various time series tasks, including time series reconstruction, time series cloud removal, time series semantic change detection, and time series forecasting. Based on the flow matching generative paradigm, UniTS constructs a deterministic evolution path from noise to targets under the guidance of task-specific conditions, achieving unified modeling of spatiotemporal representations for multiple tasks. The UniTS architecture consists of a diffusion transformer with spatio-temporal blocks, where we design an Adaptive Condition Injector (ACor) to enhance the model's conditional perception of multimodal inputs, enabling high-quality controllable generation. Additionally, we design a Spatiotemporal-aware Modulator (STM) to improve the ability of spatio-temporal blocks to capture complex spatiotemporal dependencies. Furthermore, we construct two high-quality multimodal time series datasets, TS-S12 and TS-S12CR, filling the gap of benchmark datasets for time series cloud removal and forecasting tasks. Extensive experiments demonstrate that UniTS exhibits exceptional generative and cognitive capabilities in both low-level and high-level time series tasks. It significantly outperforms existing methods, particularly when facing challenges such as severe cloud contamination, modality absence, and forecasting phenological variations.

UniTS: Unified Time Series Generative Model for Remote Sensing

TL;DR

This work addresses the lack of a unified model for diverse remote-sensing time-series tasks by introducing UniTS, a flow-matching based generative framework built on a diffusion transformer with spatiotemporal blocks. It jointly models time-series reconstruction, cloud removal, semantic change detection, and forecasting, leveraging Adaptive Condition Injector and Spatiotemporal-aware Modulator to handle multimodal conditioning and spatiotemporal priors. The authors also present TS-S12 and TS-S12CR, two high-quality benchmarks for cloud removal and forecasting in realistic settings. Across reconstruction, cloud removal, semantic change detection, and forecasting, UniTS outperforms task-specific and foundation models, showing strong generative and cognitive capabilities even under heavy cloud cover or modality absence, highlighting its potential for practical Earth observation and future world-model development.

Abstract

One of the primary objectives of satellite remote sensing is to capture the complex dynamics of the Earth environment, which encompasses tasks such as reconstructing continuous cloud-free time series images, detecting land cover changes, and forecasting future surface evolution. However, existing methods typically require specialized models tailored to different tasks, lacking unified modeling of spatiotemporal features across multiple time series tasks. In this paper, we propose a Unified Time Series Generative Model (UniTS), a general framework applicable to various time series tasks, including time series reconstruction, time series cloud removal, time series semantic change detection, and time series forecasting. Based on the flow matching generative paradigm, UniTS constructs a deterministic evolution path from noise to targets under the guidance of task-specific conditions, achieving unified modeling of spatiotemporal representations for multiple tasks. The UniTS architecture consists of a diffusion transformer with spatio-temporal blocks, where we design an Adaptive Condition Injector (ACor) to enhance the model's conditional perception of multimodal inputs, enabling high-quality controllable generation. Additionally, we design a Spatiotemporal-aware Modulator (STM) to improve the ability of spatio-temporal blocks to capture complex spatiotemporal dependencies. Furthermore, we construct two high-quality multimodal time series datasets, TS-S12 and TS-S12CR, filling the gap of benchmark datasets for time series cloud removal and forecasting tasks. Extensive experiments demonstrate that UniTS exhibits exceptional generative and cognitive capabilities in both low-level and high-level time series tasks. It significantly outperforms existing methods, particularly when facing challenges such as severe cloud contamination, modality absence, and forecasting phenological variations.

Paper Structure

This paper contains 21 sections, 8 equations, 11 figures, 14 tables.

Figures (11)

  • Figure 1: The framework of UniTS applied to four remote sensing time series tasks. UniTS takes the conditions and noise from different time series tasks as input, and through a trained complete velocity field, gradually samples from the noise distribution to the target data distribution, producing time series outputs including cloud-free time series, semantic change maps, future predictions, and more.
  • Figure 2: Left: TS-S12 dataset; middle: Geographical distribution of ROIs; right: TS-S12CR dataset.
  • Figure 3: The framework of UniTS. Taking the time series cloud removal task as an example. (a) UniTS architecture, (b) Adaptive Condition Injector (ACor), (c) Spatiotemporal-aware Modulator (STM).
  • Figure 4: UniTS inference workflows for different time series tasks: (a) Multi-frame prediction for time series reconstruction and time series cloud removal, (b) Multi-frame prediction for time series semantic change detection, where class-specific maps are generated for each output frame, (c) Autoregressive multi-frame prediction for time series forecasting. Historical sequences and random noise are jointly fed into UniTS to predict the initial future frame. The predicted frames are then recursively used as conditions for subsequent time steps, progressively generating the full future sequence. To align with the temporal evolution characteristics of forecasting tasks, the input and spatio-temporal block are correspondingly adapted during both training and inference.
  • Figure 5: Qualitative comparison of time series reconstruction on TS-S12 Dataset, presenting the RGB band of Sentinel-2 here. The SSIM value of each frames in the time series is marked.
  • ...and 6 more figures