Adaptive Task Balancing for Visual Instruction Tuning via Inter-Task Contribution and Intra-Task Difficulty

Yanqi Dai; Yong Wang; Zebin You; Dong Jing; Xiangxiang Chu; Zhiwu Lu

Adaptive Task Balancing for Visual Instruction Tuning via Inter-Task Contribution and Intra-Task Difficulty

Yanqi Dai, Yong Wang, Zebin You, Dong Jing, Xiangxiang Chu, Zhiwu Lu

TL;DR

This paper tackles the challenge of learning across multiple visual tasks without incurring imbalanced or conflicting improvements. It introduces VisATB, a three-pronged framework comprising token-level Visual Instruction Task Weighting (VITW), Inter-Task Contribution Balancing, and Intra-Task Difficulty Balancing, which are integrated via a weighted combination to steer training across tasks. By quantifying how tasks contribute to one another and how hard each task is to learn, VisATB selectively emphasizes tasks that maximize overall gain while mitigating imbalance, achieving superior and more balanced improvements on M^3IT, Academic, and Chat benchmarks. The approach is demonstrated to be robust across model sizes and shows practical time-cost considerations, making it a scalable solution for visual instruction tuning in large multimodal models.

Abstract

Visual instruction tuning is a key training stage of large multimodal models. However, when learning multiple visual tasks simultaneously, this approach often results in suboptimal and imbalanced overall performance due to latent knowledge conflicts across tasks. To mitigate this issue, we propose a novel Adaptive Task Balancing approach tailored for visual instruction tuning (VisATB). Specifically, we measure two critical dimensions for visual task balancing based on validation performance: (1) Inter-Task Contribution, the mechanism where learning one task enhances the performance on others owing to shared knowledge across tasks, and (2) Intra-Task Difficulty, which denotes the inherent learning difficulty of a single task. Furthermore, we propose prioritizing three categories of tasks with greater weight: those that offer substantial contributions to others, those that receive minimal contributions from others, and those that present high learning difficulties. Among these three task weighting strategies, the first and third focus on improving overall performance, and the second targets the mitigation of performance imbalance. Extensive experiments on three benchmarks demonstrate that our VisATB approach consistently achieves superior and more balanced overall performance in visual instruction tuning. The data, code, and models are available at https://github.com/YanqiDai/VisATB.

Adaptive Task Balancing for Visual Instruction Tuning via Inter-Task Contribution and Intra-Task Difficulty

TL;DR

Abstract

Paper Structure (21 sections, 12 equations, 4 figures, 14 tables)

This paper contains 21 sections, 12 equations, 4 figures, 14 tables.

Introduction
Method
Visual Instruction Task Weighting
Inter-Task Contribution Balancing
Intra-Task Difficulty Balancing
VisATB: Adaptive Task Balancing
Experiments
Experimental Setup
Evaluation on the M$^3$IT Benchmark
Evaluation on the Academic Benchmark
Evaluation on the Chat Benchmark
Ablation Studies
Related Work
Conclusion
Task Information and Data Preparation
...and 6 more sections

Figures (4)

Figure 1: Schematic illustrations of inter-task contributions and intra-task difficulties. (a) The red words reveal that different tasks have overlapping knowledge domains, enabling inter-task contributions. (b) The different performance improvement trajectories w.r.t. training data amount reflect distinct degrees of intra-task difficulties.
Figure 2: Overview of VisATB. In the preparation stage, we train models on the mini subset of all tasks and the dataset of each task, and validate their performance across all tasks to measure inter-task contribution and intra-task difficulty. In the task weight calculation stage, we compute three types of task weights and integrate them into the task weight $\bm{\lambda_{\textbf{VisATB}}}$. In the final training stage, we utilize the entire dataset of all tasks and $\bm{\lambda_{\textbf{VisATB}}}$ to obtain the final model under the VITW paradigm.
Figure 3: Visualizations of inter-task contributions and intra-task difficulties calculated in VisATB on the Academic Benchmark.
Figure 4: Training loss of EW in the Academic Benchmark.

Adaptive Task Balancing for Visual Instruction Tuning via Inter-Task Contribution and Intra-Task Difficulty

TL;DR

Abstract

Adaptive Task Balancing for Visual Instruction Tuning via Inter-Task Contribution and Intra-Task Difficulty

Authors

TL;DR

Abstract

Table of Contents

Figures (4)