Table of Contents
Fetching ...

Landsat30-AU: A Vision-Language Dataset for Australian Landsat Imagery

Sai Ma, Zhuang Li, John A Taylor

TL;DR

This paper introduces Landsat30-AU, the first large-scale vision-language dataset built entirely from 30-meter Landsat imagery across four missions (5, 7, 8, 9) spanning 1988–2024 in Australia. It presents a bootstrapped, multi-stage pipeline that leverages generic VLMs, fine-tuning, and human verification to produce Landsat-aligned image captions and eight-domain VQA items, yielding two sub-datasets: Landsat30-AU-Cap with 196,262 captions and Landsat30-AU-VQA with 17,725 VQA samples. Benchmark results reveal that off-the-shelf models underperform on Landsat data, but lightweight fine-tuning (e.g., Qwen2.5-VL-7B) substantially improves captioning and VQA performance (SPIDEr up to ~0.31 and overall VQA accuracy up to ~0.87). The dataset also demonstrates that specialized remote-sensing VLMs exhibit limited generalization and that sensor diversity and temporal depth are critical for robust long-term Earth monitoring. Overall, Landsat30-AU provides a solid foundation for budget-friendly, bias-robust Earth observation with VLMs and highlights key areas for future model development in low-resolution, multi-decadal satellite imagery.

Abstract

Vision language models (VLMs) that enable natural language interaction with satellite imagery can democratize Earth observation by accelerating expert workflows, making data accessible to non-specialists, and enabling planet-scale automation. However, existing datasets focus mainly on short-term, high-resolution imagery from a limited number of satellites, overlooking low-resolution, multi-satellite, long-term archives, such as Landsat, that are essential for affordable and bias-robust global monitoring. We address this gap with Landsat30-AU, a large-scale vision-language dataset built from 30-meter resolution imagery collected by four Landsat satellites (5, 7, 8, and 9) over Australia, spanning more than 36 years. The dataset includes two components: Landsat30-AU-Cap, containing $196,262$ image-caption pairs, and Landsat30-AU-VQA, comprising 17,725 human-verified visual question answering (VQA) samples across eight remote sensing domains. Both datasets are curated through a bootstrapped pipeline that leverages generic VLMs with iterative refinement and human verification to ensure quality. Our evaluation of eight VLMs on our benchmark reveals that off-the-shelf models struggle to understand satellite imagery. The open-source remote-sensing VLM EarthDial achieves only 0.07 SPIDEr in captioning and a VQA accuracy of 0.48, highlighting the limitations of current approaches. Encouragingly, lightweight fine-tuning of Qwen2.5-VL-7B on Landsat30-AU improves captioning performance from 0.11 to 0.31 SPIDEr and boosts VQA accuracy from 0.74 to 0.87. Code and data are available at https://github.com/papersubmit1/landsat30-au.

Landsat30-AU: A Vision-Language Dataset for Australian Landsat Imagery

TL;DR

This paper introduces Landsat30-AU, the first large-scale vision-language dataset built entirely from 30-meter Landsat imagery across four missions (5, 7, 8, 9) spanning 1988–2024 in Australia. It presents a bootstrapped, multi-stage pipeline that leverages generic VLMs, fine-tuning, and human verification to produce Landsat-aligned image captions and eight-domain VQA items, yielding two sub-datasets: Landsat30-AU-Cap with 196,262 captions and Landsat30-AU-VQA with 17,725 VQA samples. Benchmark results reveal that off-the-shelf models underperform on Landsat data, but lightweight fine-tuning (e.g., Qwen2.5-VL-7B) substantially improves captioning and VQA performance (SPIDEr up to ~0.31 and overall VQA accuracy up to ~0.87). The dataset also demonstrates that specialized remote-sensing VLMs exhibit limited generalization and that sensor diversity and temporal depth are critical for robust long-term Earth monitoring. Overall, Landsat30-AU provides a solid foundation for budget-friendly, bias-robust Earth observation with VLMs and highlights key areas for future model development in low-resolution, multi-decadal satellite imagery.

Abstract

Vision language models (VLMs) that enable natural language interaction with satellite imagery can democratize Earth observation by accelerating expert workflows, making data accessible to non-specialists, and enabling planet-scale automation. However, existing datasets focus mainly on short-term, high-resolution imagery from a limited number of satellites, overlooking low-resolution, multi-satellite, long-term archives, such as Landsat, that are essential for affordable and bias-robust global monitoring. We address this gap with Landsat30-AU, a large-scale vision-language dataset built from 30-meter resolution imagery collected by four Landsat satellites (5, 7, 8, and 9) over Australia, spanning more than 36 years. The dataset includes two components: Landsat30-AU-Cap, containing image-caption pairs, and Landsat30-AU-VQA, comprising 17,725 human-verified visual question answering (VQA) samples across eight remote sensing domains. Both datasets are curated through a bootstrapped pipeline that leverages generic VLMs with iterative refinement and human verification to ensure quality. Our evaluation of eight VLMs on our benchmark reveals that off-the-shelf models struggle to understand satellite imagery. The open-source remote-sensing VLM EarthDial achieves only 0.07 SPIDEr in captioning and a VQA accuracy of 0.48, highlighting the limitations of current approaches. Encouragingly, lightweight fine-tuning of Qwen2.5-VL-7B on Landsat30-AU improves captioning performance from 0.11 to 0.31 SPIDEr and boosts VQA accuracy from 0.74 to 0.87. Code and data are available at https://github.com/papersubmit1/landsat30-au.

Paper Structure

This paper contains 74 sections, 14 figures, 17 tables.

Figures (14)

  • Figure 1: Overview of the Landsat30-AU dataset construction pipeline. Stage 1: Sources Landsat imagery and collects metadata. Stage 2: Adapts VLMs into specialized modules for region classification, caption generation, and review. Stage 3: Produces large-scale annotations via iterative VLM refinement and human verification.
  • Figure 2: Examples of the human verification process. (a) A correct caption is kept. (b) An incorrect answer is fixed.
  • Figure 3: Landsat30-AU-VQA categories. Representative 30-meter Landsat imagery illustrating the visual characteristics of each of the eight VQA domains.
  • Figure 4: More caption and VQA human-verification examples.
  • Figure 5: Caption Review.
  • ...and 9 more figures