Table of Contents
Fetching ...

Towards Temporal Change Explanations from Bi-Temporal Satellite Images

Ryo Tsujimoto, Hiroki Ouchi, Hidetaka Kamigaito, Taro Watanabe

TL;DR

Explores generating temporal-change explanations from bi-temporal satellite images using LVLMs despite single-image input constraints. Proposes three prompting strategies—All-at-Once, Step-by-Step, and Hybrid—and evaluates them with automatic noun-coverage and manual truthfulness/informativeness on the Levir-CC dataset. Finds Step-by-Step prompting with LVLMs yields the strongest overall explanations, while All-at-Once with GPT-4V excels in truthfulness and informativeness; Hybrid prompting offers strong coverage. Highlights over-explanation as a challenge when changes are minimal and suggests future work on refined evaluation, task-tailored preprocessing, and leveraging multi-image LVLMs for richer temporal explanations.

Abstract

Explaining temporal changes between satellite images taken at different times is important for urban planning and environmental monitoring. However, manual dataset construction for the task is costly, so human-AI collaboration is promissing. Toward the direction, in this paper, we investigate the ability of Large-scale Vision-Language Models (LVLMs) to explain temporal changes between satellite images. While LVLMs are known to generate good image captions, they receive only a single image as input. To deal with a par of satellite images as input, we propose three prompting methods. Through human evaluation, we found the effectiveness of our step-by-step reasoning based prompting.

Towards Temporal Change Explanations from Bi-Temporal Satellite Images

TL;DR

Explores generating temporal-change explanations from bi-temporal satellite images using LVLMs despite single-image input constraints. Proposes three prompting strategies—All-at-Once, Step-by-Step, and Hybrid—and evaluates them with automatic noun-coverage and manual truthfulness/informativeness on the Levir-CC dataset. Finds Step-by-Step prompting with LVLMs yields the strongest overall explanations, while All-at-Once with GPT-4V excels in truthfulness and informativeness; Hybrid prompting offers strong coverage. Highlights over-explanation as a challenge when changes are minimal and suggests future work on refined evaluation, task-tailored preprocessing, and leveraging multi-image LVLMs for richer temporal explanations.

Abstract

Explaining temporal changes between satellite images taken at different times is important for urban planning and environmental monitoring. However, manual dataset construction for the task is costly, so human-AI collaboration is promissing. Toward the direction, in this paper, we investigate the ability of Large-scale Vision-Language Models (LVLMs) to explain temporal changes between satellite images. While LVLMs are known to generate good image captions, they receive only a single image as input. To deal with a par of satellite images as input, we propose three prompting methods. Through human evaluation, we found the effectiveness of our step-by-step reasoning based prompting.
Paper Structure (25 sections, 6 figures, 5 tables)

This paper contains 25 sections, 6 figures, 5 tables.

Figures (6)

  • Figure 1: Example of bi-temporal satellite images with their captions in Levir-CC; the left is the one before the change and the right is the one after the change.
  • Figure 2: Explaining temporal changes from bi-temporal SI using two types of prompting
  • Figure 3: Example of Truthfulness score of 1
  • Figure 4: Example of Informativeness score of 1
  • Figure 5: Example output for informativeness score of 1
  • ...and 1 more figures