Transferring Textual Preferences to Vision-Language Understanding through Model Merging
Chen-An Li, Tzu-Han Lin, Yun-Nung Chen, Hung-yi Lee
TL;DR
The paper tackles transferring textual preferences to vision-language understanding without additional training. It introduces a training-free VLRM built by merging a text-based RM with a LVLM, leveraging MergeKit-based strategies such as Weighted Averaging, Task Arithmetic, TIES, and DARE. Across VL-RewardBench, TextVQA, and MMMU-Pro, the merged VLRMs consistently outperform LVLM scoring and text RM baselines, and in some cases approach or match large open-source or commercial models, all with lower computational cost. This approach offers a practical, resource-efficient means to infuse textual preferences into LVLMs and could complement or reduce reliance on multimodal preference data collection.
Abstract
Large vision-language models (LVLMs) perform outstandingly across various multimodal tasks. However, their ability to evaluate generated content remains limited, and training vision-language reward models (VLRMs) with preference data is computationally expensive. This paper explores a training-free alternative by merging text-based reward models (RMs) with LVLMs to create VLRMs. Our approach shows that integrating these models leads to improved performance over LVLMs' scoring and text-based RMs, offering an efficient method for incorporating textual preferences into LVLMs.
