UniTS: Unified Time Series Generative Model for Remote Sensing
Yuxiang Zhang, Shunlin Liang, Wenyuan Li, Han Ma, Jianglei Xu, Yichuan Ma, Jiangwei Xie, Wei Li, Mengmeng Zhang, Ran Tao, Xiang-Gen Xia
TL;DR
This work addresses the lack of a unified model for diverse remote-sensing time-series tasks by introducing UniTS, a flow-matching based generative framework built on a diffusion transformer with spatiotemporal blocks. It jointly models time-series reconstruction, cloud removal, semantic change detection, and forecasting, leveraging Adaptive Condition Injector and Spatiotemporal-aware Modulator to handle multimodal conditioning and spatiotemporal priors. The authors also present TS-S12 and TS-S12CR, two high-quality benchmarks for cloud removal and forecasting in realistic settings. Across reconstruction, cloud removal, semantic change detection, and forecasting, UniTS outperforms task-specific and foundation models, showing strong generative and cognitive capabilities even under heavy cloud cover or modality absence, highlighting its potential for practical Earth observation and future world-model development.
Abstract
One of the primary objectives of satellite remote sensing is to capture the complex dynamics of the Earth environment, which encompasses tasks such as reconstructing continuous cloud-free time series images, detecting land cover changes, and forecasting future surface evolution. However, existing methods typically require specialized models tailored to different tasks, lacking unified modeling of spatiotemporal features across multiple time series tasks. In this paper, we propose a Unified Time Series Generative Model (UniTS), a general framework applicable to various time series tasks, including time series reconstruction, time series cloud removal, time series semantic change detection, and time series forecasting. Based on the flow matching generative paradigm, UniTS constructs a deterministic evolution path from noise to targets under the guidance of task-specific conditions, achieving unified modeling of spatiotemporal representations for multiple tasks. The UniTS architecture consists of a diffusion transformer with spatio-temporal blocks, where we design an Adaptive Condition Injector (ACor) to enhance the model's conditional perception of multimodal inputs, enabling high-quality controllable generation. Additionally, we design a Spatiotemporal-aware Modulator (STM) to improve the ability of spatio-temporal blocks to capture complex spatiotemporal dependencies. Furthermore, we construct two high-quality multimodal time series datasets, TS-S12 and TS-S12CR, filling the gap of benchmark datasets for time series cloud removal and forecasting tasks. Extensive experiments demonstrate that UniTS exhibits exceptional generative and cognitive capabilities in both low-level and high-level time series tasks. It significantly outperforms existing methods, particularly when facing challenges such as severe cloud contamination, modality absence, and forecasting phenological variations.
