Adaptive Task Balancing for Visual Instruction Tuning via Inter-Task Contribution and Intra-Task Difficulty
Yanqi Dai, Yong Wang, Zebin You, Dong Jing, Xiangxiang Chu, Zhiwu Lu
TL;DR
This paper tackles the challenge of learning across multiple visual tasks without incurring imbalanced or conflicting improvements. It introduces VisATB, a three-pronged framework comprising token-level Visual Instruction Task Weighting (VITW), Inter-Task Contribution Balancing, and Intra-Task Difficulty Balancing, which are integrated via a weighted combination to steer training across tasks. By quantifying how tasks contribute to one another and how hard each task is to learn, VisATB selectively emphasizes tasks that maximize overall gain while mitigating imbalance, achieving superior and more balanced improvements on M^3IT, Academic, and Chat benchmarks. The approach is demonstrated to be robust across model sizes and shows practical time-cost considerations, making it a scalable solution for visual instruction tuning in large multimodal models.
Abstract
Visual instruction tuning is a key training stage of large multimodal models. However, when learning multiple visual tasks simultaneously, this approach often results in suboptimal and imbalanced overall performance due to latent knowledge conflicts across tasks. To mitigate this issue, we propose a novel Adaptive Task Balancing approach tailored for visual instruction tuning (VisATB). Specifically, we measure two critical dimensions for visual task balancing based on validation performance: (1) Inter-Task Contribution, the mechanism where learning one task enhances the performance on others owing to shared knowledge across tasks, and (2) Intra-Task Difficulty, which denotes the inherent learning difficulty of a single task. Furthermore, we propose prioritizing three categories of tasks with greater weight: those that offer substantial contributions to others, those that receive minimal contributions from others, and those that present high learning difficulties. Among these three task weighting strategies, the first and third focus on improving overall performance, and the second targets the mitigation of performance imbalance. Extensive experiments on three benchmarks demonstrate that our VisATB approach consistently achieves superior and more balanced overall performance in visual instruction tuning. The data, code, and models are available at https://github.com/YanqiDai/VisATB.
