UniRS: Unifying Multi-temporal Remote Sensing Tasks through Vision Language Models

Yujie Li; Wenjia Xu; Guangzuo Li; Zijian Yu; Zhiwei Wei; Jiuniu Wang; Mugen Peng

UniRS: Unifying Multi-temporal Remote Sensing Tasks through Vision Language Models

Yujie Li, Wenjia Xu, Guangzuo Li, Zijian Yu, Zhiwei Wei, Jiuniu Wang, Mugen Peng

TL;DR

UniRS tackles the problem of fragmented generalization in remote sensing by unifying multi-temporal tasks—single image, dual-time image pair, and video—into a single vision-language framework. It introduces a unified visual embedding, a Change Extraction module for dual-time inputs, a prompt augmentation mechanism that leverages a base VLM to guide reasoning, and joint instruction tuning on a mixed RS dataset. The approach achieves state-of-the-art performance across RSVQA, LEVIR-CC change captioning, and ERA video classification, while demonstrating the value of cross-task knowledge sharing and temporal specialization. The work promises practical impact by enabling versatile, instructions-driven remote sensing analysis with a single model and a shared knowledge base.

Abstract

The domain gap between remote sensing imagery and natural images has recently received widespread attention and Vision-Language Models (VLMs) have demonstrated excellent generalization performance in remote sensing multimodal tasks. However, current research is still limited in exploring how remote sensing VLMs handle different types of visual inputs. To bridge this gap, we introduce \textbf{UniRS}, the first vision-language model \textbf{uni}fying multi-temporal \textbf{r}emote \textbf{s}ensing tasks across various types of visual input. UniRS supports single images, dual-time image pairs, and videos as input, enabling comprehensive remote sensing temporal analysis within a unified framework. We adopt a unified visual representation approach, enabling the model to accept various visual inputs. For dual-time image pair tasks, we customize a change extraction module to further enhance the extraction of spatiotemporal features. Additionally, we design a prompt augmentation mechanism tailored to the model's reasoning process, utilizing the prior knowledge of the general-purpose VLM to provide clues for UniRS. To promote multi-task knowledge sharing, the model is jointly fine-tuned on a mixed dataset. Experimental results show that UniRS achieves state-of-the-art performance across diverse tasks, including visual question answering, change captioning, and video scene classification, highlighting its versatility and effectiveness in unifying these multi-temporal remote sensing tasks. Our code and dataset will be released soon.

UniRS: Unifying Multi-temporal Remote Sensing Tasks through Vision Language Models

TL;DR

Abstract

Paper Structure (24 sections, 12 equations, 5 figures, 10 tables)

This paper contains 24 sections, 12 equations, 5 figures, 10 tables.

Introduction
Related Work
Vision-Language Models
VLMs in Remote Sensing
Multi-task Learning
Methodology
Overview
UniRS Framework
UniRS Architecture
Prompt Augmentation Mechanism
Joint Instruction Tuning
Experiments
Implementation Details
Remote Sensing Visual Question Answering
Comparision on RSVQA-LR
...and 9 more sections

Figures (5)

Figure 1: Our UniRS is a framework unifying multi-temporal remote sensing tasks of various visual inputs within a single model. It can analyze three critical types of remote sensing visual inputs i.e., single image, dual-time image pair, and video, under task instructions. Our research focuses on typical remote sensing tasks for each input type, including visual question answering, change captioning, and video classification.
Figure 2: The architecture of our UniRS. The left part of this figure includes the prompt augmentation mechanism and UniRS main architecture. UniRS is primarily composed of four components, i.e., visual encoder $\mathcal{E}_v$, multimodal projector $\mathcal{M}_{m}$, language module $\mathcal{M}_{l}$, and change extraction module $\mathcal{M}_{c}$. Here change extraction module $\mathcal{M}_{c}$ is designed for the dual-time image pair input to extract and enhance spatiotemporal relationship features between image pairs. During inference, all visual inputs $\bm{I}$ are encoded into visual features $F$ by the visual encoder $\mathcal{E}_v$. In the prompt augmentation mechanism, initial visual clues $P_c$ are obtained after parsing and merged with the task instruction $P_t$ to form the full prompt $P$. In UniRS, the multimodal projector $\mathcal{M}_{m}$ projects visual feature $F$ into the text feature space as visual embedding $E_{I}$, which is then combined with the text embedding $E_{P}$ and fed into the language module $\mathcal{M}_{l}$ to get the final answer $a$. The right part of this figure is the structure of the change extraction module $\mathcal{M}_{c}$.
Figure 3: The inference process of UniRS using prompt augmentation mechanism. During the execution of remote sensing tasks, visual inputs are first processed by the base model, where clues $P_{c}$ are generated under the fixed prompts $P_g$ customized for each input type. These clues, special markers, task tags, and the task instruction $P_{t}$, are then merged to form the prompt $P$ input into UniRS. The model then generates the corresponding response $a$.
Figure 4: Qualitative results of our UniRS on visual question answering \ref{['fig:sub1']} and change captioning \ref{['fig:sub2']}. We compare our UniRS with other remote sensing VLMs on samples randomly selected. The incorrect responses are highlighted in red.
Figure 5: Qualitative results of our UniRS on video scene classification.

UniRS: Unifying Multi-temporal Remote Sensing Tasks through Vision Language Models

TL;DR

Abstract

UniRS: Unifying Multi-temporal Remote Sensing Tasks through Vision Language Models

Authors

TL;DR

Abstract

Table of Contents

Figures (5)