Landsat30-AU: A Vision-Language Dataset for Australian Landsat Imagery

Sai Ma; Zhuang Li; John A Taylor

Landsat30-AU: A Vision-Language Dataset for Australian Landsat Imagery

Sai Ma, Zhuang Li, John A Taylor

TL;DR

This paper introduces Landsat30-AU, the first large-scale vision-language dataset built entirely from 30-meter Landsat imagery across four missions (5, 7, 8, 9) spanning 1988–2024 in Australia. It presents a bootstrapped, multi-stage pipeline that leverages generic VLMs, fine-tuning, and human verification to produce Landsat-aligned image captions and eight-domain VQA items, yielding two sub-datasets: Landsat30-AU-Cap with 196,262 captions and Landsat30-AU-VQA with 17,725 VQA samples. Benchmark results reveal that off-the-shelf models underperform on Landsat data, but lightweight fine-tuning (e.g., Qwen2.5-VL-7B) substantially improves captioning and VQA performance (SPIDEr up to ~0.31 and overall VQA accuracy up to ~0.87). The dataset also demonstrates that specialized remote-sensing VLMs exhibit limited generalization and that sensor diversity and temporal depth are critical for robust long-term Earth monitoring. Overall, Landsat30-AU provides a solid foundation for budget-friendly, bias-robust Earth observation with VLMs and highlights key areas for future model development in low-resolution, multi-decadal satellite imagery.

Abstract

Vision language models (VLMs) that enable natural language interaction with satellite imagery can democratize Earth observation by accelerating expert workflows, making data accessible to non-specialists, and enabling planet-scale automation. However, existing datasets focus mainly on short-term, high-resolution imagery from a limited number of satellites, overlooking low-resolution, multi-satellite, long-term archives, such as Landsat, that are essential for affordable and bias-robust global monitoring. We address this gap with Landsat30-AU, a large-scale vision-language dataset built from 30-meter resolution imagery collected by four Landsat satellites (5, 7, 8, and 9) over Australia, spanning more than 36 years. The dataset includes two components: Landsat30-AU-Cap, containing $196,262$ image-caption pairs, and Landsat30-AU-VQA, comprising 17,725 human-verified visual question answering (VQA) samples across eight remote sensing domains. Both datasets are curated through a bootstrapped pipeline that leverages generic VLMs with iterative refinement and human verification to ensure quality. Our evaluation of eight VLMs on our benchmark reveals that off-the-shelf models struggle to understand satellite imagery. The open-source remote-sensing VLM EarthDial achieves only 0.07 SPIDEr in captioning and a VQA accuracy of 0.48, highlighting the limitations of current approaches. Encouragingly, lightweight fine-tuning of Qwen2.5-VL-7B on Landsat30-AU improves captioning performance from 0.11 to 0.31 SPIDEr and boosts VQA accuracy from 0.74 to 0.87. Code and data are available at https://github.com/papersubmit1/landsat30-au.

Landsat30-AU: A Vision-Language Dataset for Australian Landsat Imagery

TL;DR

Abstract

Landsat30-AU: A Vision-Language Dataset for Australian Landsat Imagery

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (14)