Table of Contents
Fetching ...

TaskGalaxy: Scaling Multi-modal Instruction Fine-tuning with Tens of Thousands Vision Task Types

Jiankang Chen, Tianke Zhang, Changyi Liu, Haojie Ding, Yaya Shi, Feng Cheng, Huihui Xiao, Bin Wen, Fan Yang, Tingting Gao, Di Zhang

TL;DR

TaskGalaxy addresses the limited task diversity in multimodal instruction fine-tuning by introducing a largely automated pipeline that expands a small seed set into 19,227 hierarchical vision-language task types and ~413k high-quality image–Q&A samples. The approach leverages GPT-4o for task expansion and Q&A generation, CLIP for initial image–task matching, and a trio of open-source models for referee-based quality filtering, enabling scalable data production. Fine-tuning LLaVA and InternVL-Chat with TaskGalaxy yields consistent improvements across 16 benchmarks, including a 68-point gain on MME for LLaVA-v1.5-13B, demonstrating that diverse task types bolster generalization. The work contributes a near-complete automation pipeline for multimodal data generation and shows the practical value of task-type diversity, with public release plans to foster broader adoption and further research.

Abstract

Multimodal visual language models are gaining prominence in open-world applications, driven by advancements in model architectures, training techniques, and high-quality data. However, their performance is often limited by insufficient task-specific data, leading to poor generalization and biased outputs. Existing efforts to increase task diversity in fine-tuning datasets are hindered by the labor-intensive process of manual task labeling, which typically produces only a few hundred task types. To address this, we propose TaskGalaxy, a large-scale multimodal instruction fine-tuning dataset comprising 19,227 hierarchical task types and 413,648 samples. TaskGalaxy utilizes GPT-4o to enrich task diversity by expanding from a small set of manually defined tasks, with CLIP and GPT-4o filtering those that best match open-source images, and generating relevant question-answer pairs. Multiple models are employed to ensure sample quality. This automated process enhances both task diversity and data quality, reducing manual intervention. Incorporating TaskGalaxy into LLaVA-v1.5 and InternVL-Chat-v1.0 models shows substantial performance improvements across 16 benchmarks, demonstrating the critical importance of task diversity. TaskGalaxy is publicly released at https://github.com/Kwai-YuanQi/TaskGalaxy.

TaskGalaxy: Scaling Multi-modal Instruction Fine-tuning with Tens of Thousands Vision Task Types

TL;DR

TaskGalaxy addresses the limited task diversity in multimodal instruction fine-tuning by introducing a largely automated pipeline that expands a small seed set into 19,227 hierarchical vision-language task types and ~413k high-quality image–Q&A samples. The approach leverages GPT-4o for task expansion and Q&A generation, CLIP for initial image–task matching, and a trio of open-source models for referee-based quality filtering, enabling scalable data production. Fine-tuning LLaVA and InternVL-Chat with TaskGalaxy yields consistent improvements across 16 benchmarks, including a 68-point gain on MME for LLaVA-v1.5-13B, demonstrating that diverse task types bolster generalization. The work contributes a near-complete automation pipeline for multimodal data generation and shows the practical value of task-type diversity, with public release plans to foster broader adoption and further research.

Abstract

Multimodal visual language models are gaining prominence in open-world applications, driven by advancements in model architectures, training techniques, and high-quality data. However, their performance is often limited by insufficient task-specific data, leading to poor generalization and biased outputs. Existing efforts to increase task diversity in fine-tuning datasets are hindered by the labor-intensive process of manual task labeling, which typically produces only a few hundred task types. To address this, we propose TaskGalaxy, a large-scale multimodal instruction fine-tuning dataset comprising 19,227 hierarchical task types and 413,648 samples. TaskGalaxy utilizes GPT-4o to enrich task diversity by expanding from a small set of manually defined tasks, with CLIP and GPT-4o filtering those that best match open-source images, and generating relevant question-answer pairs. Multiple models are employed to ensure sample quality. This automated process enhances both task diversity and data quality, reducing manual intervention. Incorporating TaskGalaxy into LLaVA-v1.5 and InternVL-Chat-v1.0 models shows substantial performance improvements across 16 benchmarks, demonstrating the critical importance of task diversity. TaskGalaxy is publicly released at https://github.com/Kwai-YuanQi/TaskGalaxy.

Paper Structure

This paper contains 23 sections, 1 equation, 68 figures, 9 tables.

Figures (68)

  • Figure 1: An illustration of the benefits of high task type coverage in TaskGalaxy for the SFT stage. We presented the performance of LLaVA-v1.5-13B and InternVL-Chat-v1.0-7B models, both before and after integrating TaskGalaxy into the fine-tuning dataset.
  • Figure 2: An overview of the task type and high-quality question-answer pairs generation pipeline for TaskGalaxy. We initially define the first level of visual task types, along with a small number of second and third level task types. Subsequently, we instruct GPT-4o to extend these to a broader range of task types. We then collect image modalities from existing publicly available datasets for matching task types with images, filtering, generating question answers related to task types, and utilizing the three referee models to obtain final high-quality visual quiz pairs for various task types strongly related to images.
  • Figure 3: The prompt template used in GPT-4o API for first-level task type generation.
  • Figure 4: Sample images, task types, and Q&A in TaskGalaxy. The Task Type refers to the visual task related to the image. Question and Answers are generated by GPT-4o and subsequently filtered by three refereeing models.
  • Figure 5: Distribution of the number of images across the 19,227 task types in TaskGalaxy. The ranges 1-10, 21-40 and etc. indicate the number of samples associated with different task types in TaskGalaxy. The corresponding ratios represent the proportion of task types that fall within each specified sample range.
  • ...and 63 more figures