Towards Temporal Change Explanations from Bi-Temporal Satellite Images
Ryo Tsujimoto, Hiroki Ouchi, Hidetaka Kamigaito, Taro Watanabe
TL;DR
Explores generating temporal-change explanations from bi-temporal satellite images using LVLMs despite single-image input constraints. Proposes three prompting strategies—All-at-Once, Step-by-Step, and Hybrid—and evaluates them with automatic noun-coverage and manual truthfulness/informativeness on the Levir-CC dataset. Finds Step-by-Step prompting with LVLMs yields the strongest overall explanations, while All-at-Once with GPT-4V excels in truthfulness and informativeness; Hybrid prompting offers strong coverage. Highlights over-explanation as a challenge when changes are minimal and suggests future work on refined evaluation, task-tailored preprocessing, and leveraging multi-image LVLMs for richer temporal explanations.
Abstract
Explaining temporal changes between satellite images taken at different times is important for urban planning and environmental monitoring. However, manual dataset construction for the task is costly, so human-AI collaboration is promissing. Toward the direction, in this paper, we investigate the ability of Large-scale Vision-Language Models (LVLMs) to explain temporal changes between satellite images. While LVLMs are known to generate good image captions, they receive only a single image as input. To deal with a par of satellite images as input, we propose three prompting methods. Through human evaluation, we found the effectiveness of our step-by-step reasoning based prompting.
