TaskGalaxy: Scaling Multi-modal Instruction Fine-tuning with Tens of Thousands Vision Task Types

Jiankang Chen; Tianke Zhang; Changyi Liu; Haojie Ding; Yaya Shi; Feng Cheng; Huihui Xiao; Bin Wen; Fan Yang; Tingting Gao; Di Zhang

TaskGalaxy: Scaling Multi-modal Instruction Fine-tuning with Tens of Thousands Vision Task Types

Jiankang Chen, Tianke Zhang, Changyi Liu, Haojie Ding, Yaya Shi, Feng Cheng, Huihui Xiao, Bin Wen, Fan Yang, Tingting Gao, Di Zhang

TL;DR

TaskGalaxy addresses the limited task diversity in multimodal instruction fine-tuning by introducing a largely automated pipeline that expands a small seed set into 19,227 hierarchical vision-language task types and ~413k high-quality image–Q&A samples. The approach leverages GPT-4o for task expansion and Q&A generation, CLIP for initial image–task matching, and a trio of open-source models for referee-based quality filtering, enabling scalable data production. Fine-tuning LLaVA and InternVL-Chat with TaskGalaxy yields consistent improvements across 16 benchmarks, including a 68-point gain on MME for LLaVA-v1.5-13B, demonstrating that diverse task types bolster generalization. The work contributes a near-complete automation pipeline for multimodal data generation and shows the practical value of task-type diversity, with public release plans to foster broader adoption and further research.

Abstract

Multimodal visual language models are gaining prominence in open-world applications, driven by advancements in model architectures, training techniques, and high-quality data. However, their performance is often limited by insufficient task-specific data, leading to poor generalization and biased outputs. Existing efforts to increase task diversity in fine-tuning datasets are hindered by the labor-intensive process of manual task labeling, which typically produces only a few hundred task types. To address this, we propose TaskGalaxy, a large-scale multimodal instruction fine-tuning dataset comprising 19,227 hierarchical task types and 413,648 samples. TaskGalaxy utilizes GPT-4o to enrich task diversity by expanding from a small set of manually defined tasks, with CLIP and GPT-4o filtering those that best match open-source images, and generating relevant question-answer pairs. Multiple models are employed to ensure sample quality. This automated process enhances both task diversity and data quality, reducing manual intervention. Incorporating TaskGalaxy into LLaVA-v1.5 and InternVL-Chat-v1.0 models shows substantial performance improvements across 16 benchmarks, demonstrating the critical importance of task diversity. TaskGalaxy is publicly released at https://github.com/Kwai-YuanQi/TaskGalaxy.

TaskGalaxy: Scaling Multi-modal Instruction Fine-tuning with Tens of Thousands Vision Task Types

TL;DR

Abstract

TaskGalaxy: Scaling Multi-modal Instruction Fine-tuning with Tens of Thousands Vision Task Types

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (68)