Investigating the Scaling Effect of Instruction Templates for Training Multimodal Language Model
Shijian Wang, Linxin Song, Jieyu Zhang, Ryotaro Shimizu, Jiarui Jin, Ao Luo, Yuan Lu, Li Yao, Cunjian Chen, Julian McAuley, Wentao Zhang, Hanqian Wu
TL;DR
This work investigates how the scale of instruction templates affects multimodal language model training and introduces a programmatic template generator capable of producing about $15{,}000$ templates from 24 meta templates using a sentence pattern tree and weighted sampling. Empirically, MLM performance peaks at a medium template scale that depends on model size (e.g., 5K templates for 7B, 100 templates for 13B), with larger scales not yielding consistent gains. Importantly, augmenting training data with templates at the optimal scale improves performance without increasing data size by orders of magnitude and reduces variability across templates, demonstrating a data-efficient path for visual instruction tuning. Collectively, the results highlight the value of template diversity and scale management for robust MLM performance and offer a practical, scalable approach to improve MLMs with limited data.
Abstract
Current multimodal language model (MLM) training approaches overlook the influence of instruction templates. Previous research deals with this problem by leveraging hand-crafted or model-generated templates, failing to investigate the scaling effect of instruction templates on MLM training. In this work, we propose a programmatic instruction template generator capable of producing over 15K unique instruction templates by filling randomly sampled positional synonyms into weighted sampled meta templates, enabling us to comprehensively explore MLM's performance across various template scales in the training process. Our investigation into scaling instruction templates for MLM training demonstrates that MLM capabilities do not consistently improve with increasing template scale. Instead, optimal performance is achieved at a medium template scale. Models trained with data augmented at the optimal template scale achieve performance gains of up to 10% over those trained on the original data and achieve the best overall performance compared with the similar-scale MLMs tuned on at most 75 times the scale of our augmented dataset. The code will be publicly available at https://github.com/shijian2001/TemplateScaling.
