Table of Contents
Fetching ...

Quality-Driven Curation of Remote Sensing Vision-Language Data via Learned Scoring Models

Dilxat Muhtar, Enzhuo Zhang, Zhenshi Li, Feng Gu, Yanglangxing He, Pengfeng Xiao, Xueliang Zhang

TL;DR

This work tackles the RS data bottleneck for vision-language modeling by introducing ScoreRS, a learned quality-scoring model trained on large-scale RS-specific preference data across five quality dimensions. ScoreRS enables automated curation of high-quality image-text pairs, yielding superior performance when used to filter data for CLIP fine-tuning and large VLM finetuning, compared to full-data or CLIP-score baselines. The authors demonstrate ScoreRS’s versatility as a reward model for reinforcement learning and as a Best-of-N selector at test time, achieving improvements on challenging RS benchmarks like VG-DIOR and LHRS-Bench. The study highlights the importance of domain-specific data quality and provides open-source data, models, and prompts to foster RS-focused VLM development.

Abstract

Vision-Language Models (VLMs) have demonstrated great potential in interpreting remote sensing (RS) images through language-guided semantic. However, the effectiveness of these VLMs critically depends on high-quality image-text training data that captures rich semantic relationships between visual content and language descriptions. Unlike natural images, RS lacks large-scale interleaved image-text pairs from web data, making data collection challenging. While current approaches rely primarily on rule-based methods or flagship VLMs for data synthesis, a systematic framework for automated quality assessment of such synthetically generated RS vision-language data is notably absent. To fill this gap, we propose a novel score model trained on large-scale RS vision-language preference data for automated quality assessment. Our empirical results demonstrate that fine-tuning CLIP or advanced VLMs (e.g., Qwen2-VL) with the top 30% of data ranked by our score model achieves superior accuracy compared to both full-data fine-tuning and CLIP-score-based ranking approaches. Furthermore, we demonstrate applications of our scoring model for reinforcement learning (RL) training and best-of-N (BoN) test-time scaling, enabling significant improvements in VLM performance for RS tasks. Our code, model, and dataset are publicly available

Quality-Driven Curation of Remote Sensing Vision-Language Data via Learned Scoring Models

TL;DR

This work tackles the RS data bottleneck for vision-language modeling by introducing ScoreRS, a learned quality-scoring model trained on large-scale RS-specific preference data across five quality dimensions. ScoreRS enables automated curation of high-quality image-text pairs, yielding superior performance when used to filter data for CLIP fine-tuning and large VLM finetuning, compared to full-data or CLIP-score baselines. The authors demonstrate ScoreRS’s versatility as a reward model for reinforcement learning and as a Best-of-N selector at test time, achieving improvements on challenging RS benchmarks like VG-DIOR and LHRS-Bench. The study highlights the importance of domain-specific data quality and provides open-source data, models, and prompts to foster RS-focused VLM development.

Abstract

Vision-Language Models (VLMs) have demonstrated great potential in interpreting remote sensing (RS) images through language-guided semantic. However, the effectiveness of these VLMs critically depends on high-quality image-text training data that captures rich semantic relationships between visual content and language descriptions. Unlike natural images, RS lacks large-scale interleaved image-text pairs from web data, making data collection challenging. While current approaches rely primarily on rule-based methods or flagship VLMs for data synthesis, a systematic framework for automated quality assessment of such synthetically generated RS vision-language data is notably absent. To fill this gap, we propose a novel score model trained on large-scale RS vision-language preference data for automated quality assessment. Our empirical results demonstrate that fine-tuning CLIP or advanced VLMs (e.g., Qwen2-VL) with the top 30% of data ranked by our score model achieves superior accuracy compared to both full-data fine-tuning and CLIP-score-based ranking approaches. Furthermore, we demonstrate applications of our scoring model for reinforcement learning (RL) training and best-of-N (BoN) test-time scaling, enabling significant improvements in VLM performance for RS tasks. Our code, model, and dataset are publicly available

Paper Structure

This paper contains 84 sections, 3 equations, 15 figures, 24 tables.

Figures (15)

  • Figure 1: Examples from RS vision-language datasets showing quality issues across five dimensions. Green represents reasonably good expression, while red represents low-quality expression
  • Figure 2: Pipeline for generating pairwise preference datasets and the training/application of our ScoreRS model. $I_i \in \mathcal{I}$ represents a RS image, and $T_i \in \mathcal{T}$ represents an image caption, question, or conversation associated with the image
  • Figure 3: Classification and retrieval results using different percentages of data selected by CLIP-Score and ScoreRS. Top-1 (@1) results shown as average scores across all datasets
  • Figure 4: Comparison with different BoN selectors.
  • Figure 5: Comparison of RL-trained models. Qwen2VL-7B-RS-Zero: Directly apply RL to Qwen2VL-7B-RS. Qwen2VL-7B-RS-SFT: Qwen2VL-7B-RS fine-tuned with our manually generated reasoning data. Qwen2VL-7B-RS-R1: RL applied to Qwen2VL-7B-RS-SFT. "w/o ScoreRS": variant trained without ScoreRS-based rewards
  • ...and 10 more figures