Table of Contents
Fetching ...

Transferring Textual Preferences to Vision-Language Understanding through Model Merging

Chen-An Li, Tzu-Han Lin, Yun-Nung Chen, Hung-yi Lee

TL;DR

The paper tackles transferring textual preferences to vision-language understanding without additional training. It introduces a training-free VLRM built by merging a text-based RM with a LVLM, leveraging MergeKit-based strategies such as Weighted Averaging, Task Arithmetic, TIES, and DARE. Across VL-RewardBench, TextVQA, and MMMU-Pro, the merged VLRMs consistently outperform LVLM scoring and text RM baselines, and in some cases approach or match large open-source or commercial models, all with lower computational cost. This approach offers a practical, resource-efficient means to infuse textual preferences into LVLMs and could complement or reduce reliance on multimodal preference data collection.

Abstract

Large vision-language models (LVLMs) perform outstandingly across various multimodal tasks. However, their ability to evaluate generated content remains limited, and training vision-language reward models (VLRMs) with preference data is computationally expensive. This paper explores a training-free alternative by merging text-based reward models (RMs) with LVLMs to create VLRMs. Our approach shows that integrating these models leads to improved performance over LVLMs' scoring and text-based RMs, offering an efficient method for incorporating textual preferences into LVLMs.

Transferring Textual Preferences to Vision-Language Understanding through Model Merging

TL;DR

The paper tackles transferring textual preferences to vision-language understanding without additional training. It introduces a training-free VLRM built by merging a text-based RM with a LVLM, leveraging MergeKit-based strategies such as Weighted Averaging, Task Arithmetic, TIES, and DARE. Across VL-RewardBench, TextVQA, and MMMU-Pro, the merged VLRMs consistently outperform LVLM scoring and text RM baselines, and in some cases approach or match large open-source or commercial models, all with lower computational cost. This approach offers a practical, resource-efficient means to infuse textual preferences into LVLMs and could complement or reduce reliance on multimodal preference data collection.

Abstract

Large vision-language models (LVLMs) perform outstandingly across various multimodal tasks. However, their ability to evaluate generated content remains limited, and training vision-language reward models (VLRMs) with preference data is computationally expensive. This paper explores a training-free alternative by merging text-based reward models (RMs) with LVLMs to create VLRMs. Our approach shows that integrating these models leads to improved performance over LVLMs' scoring and text-based RMs, offering an efficient method for incorporating textual preferences into LVLMs.

Paper Structure

This paper contains 48 sections, 11 equations, 12 figures, 13 tables.

Figures (12)

  • Figure 1: Framework for merging a text-based RM with an LVLM. LVLMs excel at visual tasks, while text-based RMs struggle to provide accurate rewards without visual cues. We transfer textual preferences to the vision-language understanding, resulting in a VLRM. All icons used in this figure are sourced from https://www.flaticon.com/
  • Figure 2: Effect of Dare + Task Vec. merging hyperparameters with Tulu-2.5-RM as the text-based RM.
  • Figure 3: Full results of merging Llama-3.2-Vision and Tulu-2.5-RM (Linear)
  • Figure 4: Full results of merging Llama-3.2-Vision and Tulu-2.5-RM (Task Vec.)
  • Figure 5: Full results of merging Llama-3.2-Vision and Tulu-2.5-RM (TIES)
  • ...and 7 more figures