Table of Contents
Fetching ...

Investigating the Scaling Effect of Instruction Templates for Training Multimodal Language Model

Shijian Wang, Linxin Song, Jieyu Zhang, Ryotaro Shimizu, Jiarui Jin, Ao Luo, Yuan Lu, Li Yao, Cunjian Chen, Julian McAuley, Wentao Zhang, Hanqian Wu

TL;DR

This work investigates how the scale of instruction templates affects multimodal language model training and introduces a programmatic template generator capable of producing about $15{,}000$ templates from 24 meta templates using a sentence pattern tree and weighted sampling. Empirically, MLM performance peaks at a medium template scale that depends on model size (e.g., 5K templates for 7B, 100 templates for 13B), with larger scales not yielding consistent gains. Importantly, augmenting training data with templates at the optimal scale improves performance without increasing data size by orders of magnitude and reduces variability across templates, demonstrating a data-efficient path for visual instruction tuning. Collectively, the results highlight the value of template diversity and scale management for robust MLM performance and offer a practical, scalable approach to improve MLMs with limited data.

Abstract

Current multimodal language model (MLM) training approaches overlook the influence of instruction templates. Previous research deals with this problem by leveraging hand-crafted or model-generated templates, failing to investigate the scaling effect of instruction templates on MLM training. In this work, we propose a programmatic instruction template generator capable of producing over 15K unique instruction templates by filling randomly sampled positional synonyms into weighted sampled meta templates, enabling us to comprehensively explore MLM's performance across various template scales in the training process. Our investigation into scaling instruction templates for MLM training demonstrates that MLM capabilities do not consistently improve with increasing template scale. Instead, optimal performance is achieved at a medium template scale. Models trained with data augmented at the optimal template scale achieve performance gains of up to 10% over those trained on the original data and achieve the best overall performance compared with the similar-scale MLMs tuned on at most 75 times the scale of our augmented dataset. The code will be publicly available at https://github.com/shijian2001/TemplateScaling.

Investigating the Scaling Effect of Instruction Templates for Training Multimodal Language Model

TL;DR

This work investigates how the scale of instruction templates affects multimodal language model training and introduces a programmatic template generator capable of producing about templates from 24 meta templates using a sentence pattern tree and weighted sampling. Empirically, MLM performance peaks at a medium template scale that depends on model size (e.g., 5K templates for 7B, 100 templates for 13B), with larger scales not yielding consistent gains. Importantly, augmenting training data with templates at the optimal scale improves performance without increasing data size by orders of magnitude and reduces variability across templates, demonstrating a data-efficient path for visual instruction tuning. Collectively, the results highlight the value of template diversity and scale management for robust MLM performance and offer a practical, scalable approach to improve MLMs with limited data.

Abstract

Current multimodal language model (MLM) training approaches overlook the influence of instruction templates. Previous research deals with this problem by leveraging hand-crafted or model-generated templates, failing to investigate the scaling effect of instruction templates on MLM training. In this work, we propose a programmatic instruction template generator capable of producing over 15K unique instruction templates by filling randomly sampled positional synonyms into weighted sampled meta templates, enabling us to comprehensively explore MLM's performance across various template scales in the training process. Our investigation into scaling instruction templates for MLM training demonstrates that MLM capabilities do not consistently improve with increasing template scale. Instead, optimal performance is achieved at a medium template scale. Models trained with data augmented at the optimal template scale achieve performance gains of up to 10% over those trained on the original data and achieve the best overall performance compared with the similar-scale MLMs tuned on at most 75 times the scale of our augmented dataset. The code will be publicly available at https://github.com/shijian2001/TemplateScaling.

Paper Structure

This paper contains 14 sections, 6 figures, 1 table, 2 algorithms.

Figures (6)

  • Figure 1: Training with the optimal template scale significantly improves MLM's performance and reduces the performance variance. LLaVA-1.5-7B trained with 5K templates and LLaVA-1.5-13B trained with 100 templates achieve the highest average performance and the lowest performance variance among similar-scale MLMs on the SeedBench dataset, evaluated across 25 held-out instruction templates that are not included in training.
  • Figure 2: Example of the instruction template generation through a meta template.
  • Figure 3: Scaling trends of MLM performance with increasing template scale on each benchmark dataset. We also show the performance spread across models and datasets. Optimal template scale vary across different datasets.
  • Figure 4: Scaling trend of MLM performance with increasing template scale on the average performance across five benchmarks. There exists an optimal template scale for MLM's general capabilities, with stronger models requiring a smaller template scale.
  • Figure 5: Sentence pattern trees with meta templates. Each tree uses distinct colors to denote different levels. Placeholders are marked in red, while static segments are marked in black. We further mark the weight of each node (# generated templates).
  • ...and 1 more figures