Transferring Textual Preferences to Vision-Language Understanding through Model Merging

Chen-An Li; Tzu-Han Lin; Yun-Nung Chen; Hung-yi Lee

Transferring Textual Preferences to Vision-Language Understanding through Model Merging

Chen-An Li, Tzu-Han Lin, Yun-Nung Chen, Hung-yi Lee

TL;DR

The paper tackles transferring textual preferences to vision-language understanding without additional training. It introduces a training-free VLRM built by merging a text-based RM with a LVLM, leveraging MergeKit-based strategies such as Weighted Averaging, Task Arithmetic, TIES, and DARE. Across VL-RewardBench, TextVQA, and MMMU-Pro, the merged VLRMs consistently outperform LVLM scoring and text RM baselines, and in some cases approach or match large open-source or commercial models, all with lower computational cost. This approach offers a practical, resource-efficient means to infuse textual preferences into LVLMs and could complement or reduce reliance on multimodal preference data collection.

Abstract

Large vision-language models (LVLMs) perform outstandingly across various multimodal tasks. However, their ability to evaluate generated content remains limited, and training vision-language reward models (VLRMs) with preference data is computationally expensive. This paper explores a training-free alternative by merging text-based reward models (RMs) with LVLMs to create VLRMs. Our approach shows that integrating these models leads to improved performance over LVLMs' scoring and text-based RMs, offering an efficient method for incorporating textual preferences into LVLMs.

Transferring Textual Preferences to Vision-Language Understanding through Model Merging

TL;DR

Abstract

Transferring Textual Preferences to Vision-Language Understanding through Model Merging

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (12)