UniRS: Unifying Multi-temporal Remote Sensing Tasks through Vision Language Models
Yujie Li, Wenjia Xu, Guangzuo Li, Zijian Yu, Zhiwei Wei, Jiuniu Wang, Mugen Peng
TL;DR
UniRS tackles the problem of fragmented generalization in remote sensing by unifying multi-temporal tasks—single image, dual-time image pair, and video—into a single vision-language framework. It introduces a unified visual embedding, a Change Extraction module for dual-time inputs, a prompt augmentation mechanism that leverages a base VLM to guide reasoning, and joint instruction tuning on a mixed RS dataset. The approach achieves state-of-the-art performance across RSVQA, LEVIR-CC change captioning, and ERA video classification, while demonstrating the value of cross-task knowledge sharing and temporal specialization. The work promises practical impact by enabling versatile, instructions-driven remote sensing analysis with a single model and a shared knowledge base.
Abstract
The domain gap between remote sensing imagery and natural images has recently received widespread attention and Vision-Language Models (VLMs) have demonstrated excellent generalization performance in remote sensing multimodal tasks. However, current research is still limited in exploring how remote sensing VLMs handle different types of visual inputs. To bridge this gap, we introduce \textbf{UniRS}, the first vision-language model \textbf{uni}fying multi-temporal \textbf{r}emote \textbf{s}ensing tasks across various types of visual input. UniRS supports single images, dual-time image pairs, and videos as input, enabling comprehensive remote sensing temporal analysis within a unified framework. We adopt a unified visual representation approach, enabling the model to accept various visual inputs. For dual-time image pair tasks, we customize a change extraction module to further enhance the extraction of spatiotemporal features. Additionally, we design a prompt augmentation mechanism tailored to the model's reasoning process, utilizing the prior knowledge of the general-purpose VLM to provide clues for UniRS. To promote multi-task knowledge sharing, the model is jointly fine-tuned on a mixed dataset. Experimental results show that UniRS achieves state-of-the-art performance across diverse tasks, including visual question answering, change captioning, and video scene classification, highlighting its versatility and effectiveness in unifying these multi-temporal remote sensing tasks. Our code and dataset will be released soon.
