Table of Contents
Fetching ...

Picking the Cream of the Crop: Visual-Centric Data Selection with Collaborative Agents

Zhenyu Liu, Yunxin Li, Baotian Hu, Wenhan Luo, Yaowei Wang, Min Zhang

TL;DR

ViSA introduces a multi-agent visual data selection framework to improve the quality of visual instruction data for multimodal large language models. It jointly quantifies image informativeness via Visual Elements and Diversity Perspectives, and instruction quality via Prior Token Perplexity and Image-Text Mutual Information, combining assessments with a Shapley-value weighting based on Pearson correlations. Using this approach, the authors curate 80K high-quality instruction samples, achieving competitive or superior results on seven benchmarks with only 2.5% of the original data, and show further gains for very large models with targeted high-quality data. This work demonstrates that data quality can dramatically boost MLLM training efficiency and performance, offering a practical, scalable data curation strategy with publicly released code.

Abstract

To improve Multimodal Large Language Models' (MLLMs) ability to process images and complex instructions, researchers predominantly curate large-scale visual instruction tuning datasets, which are either sourced from existing vision tasks or synthetically generated using LLMs and image descriptions. However, they often suffer from critical flaws, including misaligned instruction-image pairs and low-quality images. Such issues hinder training efficiency and limit performance improvements, as models waste resources on noisy or irrelevant data with minimal benefit to overall capability. To address this issue, we propose a \textbf{Vi}sual-Centric \textbf{S}election approach via \textbf{A}gents Collaboration (ViSA), which centers on image quality assessment and image-instruction relevance evaluation. Specifically, our approach consists of 1) an image information quantification method via visual agents collaboration to select images with rich visual information, and 2) a visual-centric instruction quality assessment method to select high-quality instruction data related to high-quality images. Finally, we reorganize 80K instruction data from large open-source datasets. Extensive experiments demonstrate that ViSA outperforms or is comparable to current state-of-the-art models on seven benchmarks, using only 2.5\% of the original data, highlighting the efficiency of our data selection approach. Moreover, we conduct ablation studies to validate the effectiveness of each component of our method. The code is available at https://github.com/HITsz-TMG/ViSA.

Picking the Cream of the Crop: Visual-Centric Data Selection with Collaborative Agents

TL;DR

ViSA introduces a multi-agent visual data selection framework to improve the quality of visual instruction data for multimodal large language models. It jointly quantifies image informativeness via Visual Elements and Diversity Perspectives, and instruction quality via Prior Token Perplexity and Image-Text Mutual Information, combining assessments with a Shapley-value weighting based on Pearson correlations. Using this approach, the authors curate 80K high-quality instruction samples, achieving competitive or superior results on seven benchmarks with only 2.5% of the original data, and show further gains for very large models with targeted high-quality data. This work demonstrates that data quality can dramatically boost MLLM training efficiency and performance, offering a practical, scalable data curation strategy with publicly released code.

Abstract

To improve Multimodal Large Language Models' (MLLMs) ability to process images and complex instructions, researchers predominantly curate large-scale visual instruction tuning datasets, which are either sourced from existing vision tasks or synthetically generated using LLMs and image descriptions. However, they often suffer from critical flaws, including misaligned instruction-image pairs and low-quality images. Such issues hinder training efficiency and limit performance improvements, as models waste resources on noisy or irrelevant data with minimal benefit to overall capability. To address this issue, we propose a \textbf{Vi}sual-Centric \textbf{S}election approach via \textbf{A}gents Collaboration (ViSA), which centers on image quality assessment and image-instruction relevance evaluation. Specifically, our approach consists of 1) an image information quantification method via visual agents collaboration to select images with rich visual information, and 2) a visual-centric instruction quality assessment method to select high-quality instruction data related to high-quality images. Finally, we reorganize 80K instruction data from large open-source datasets. Extensive experiments demonstrate that ViSA outperforms or is comparable to current state-of-the-art models on seven benchmarks, using only 2.5\% of the original data, highlighting the efficiency of our data selection approach. Moreover, we conduct ablation studies to validate the effectiveness of each component of our method. The code is available at https://github.com/HITsz-TMG/ViSA.

Paper Structure

This paper contains 29 sections, 8 equations, 6 figures, 3 tables.

Figures (6)

  • Figure 1: Two common issues in visual instruction tuning datasets. (Left) Low-Complexity Image: The image lacks meaningful visual information. (Right) Low-Quality Text: The instruction is weakly aligned with the image.
  • Figure 2: Overview of our agent collaboration for visual data selection. The $Score(SC)$ denotes the Segmentation Complexity Score. The $Score(OA)$ shows the Object Alignment Score. The $Score(DP)$ indicates the Diversity Perspective Score. The $Score(PT)$ denotes the Prior Token Perplexity Score. The $Score(IM)$ denotes the Image-Text Mutual Information Score.
  • Figure 3: Score distributions across different datasets: (a) segmentation complexity, (b) diversity perspectives and (c) prior token perplexity.
  • Figure 4: Results of the medium-scale model on six curated complex image understanding and perception datasets.
  • Figure 5: Score distributions across different datasets: (a) object alignment, and (b) image-text mutual information.
  • ...and 1 more figures