Table of Contents
Fetching ...

Securing the Floor and Raising the Ceiling: A Merging-based Paradigm for Multi-modal Search Agents

Zhixiang Wang, Jingxuan Xu, Dajun Chen, Yunfang Wu, Wei Jiang, Yong Li

TL;DR

By fusing a text-based search agent with a base VLM, it is shown that multi-modal search capabilities can be effectively composed without any additional multi-modal training data, and Optimal Brain Merging is introduced, a saliency-aware merging algorithm that identifies task-critical parameters based on their impact on model loss using only a small set of calibration samples.

Abstract

Recent advances in Vision-Language Models (VLMs) have motivated the development of multi-modal search agents that can actively invoke external search tools and integrate retrieved evidence through multi-step reasoning. While promising, existing approaches typically rely on large-scale supervised trajectories or expensive reinforcement learning (RL), leading to high training cost, instability, and a severe cold-start problem for standard VLMs. We propose a training-free paradigm to empower VLMs with autonomous search capabilities via cross-modal model merging. By fusing a text-based search agent with a base VLM, we show that multi-modal search capabilities can be effectively composed without any additional multi-modal training data. To mitigate parameter interference during cross-modal integration, we introduce Optimal Brain Merging (OBM), a saliency-aware merging algorithm that identifies task-critical parameters based on their impact on model loss using only a small set of calibration samples. Extensive experiments on search-intensive benchmarks (e.g., InfoSeek, MMSearch) reveal that: (1) Model merging secures a reasonable performance floor as a zero-shot agent, with OBM achieving superior search rates; (2) OBM significantly raises the performance ceiling as a warm-start strategy, achieving faster convergence and higher peak accuracy than standard VLM initialization.

Securing the Floor and Raising the Ceiling: A Merging-based Paradigm for Multi-modal Search Agents

TL;DR

By fusing a text-based search agent with a base VLM, it is shown that multi-modal search capabilities can be effectively composed without any additional multi-modal training data, and Optimal Brain Merging is introduced, a saliency-aware merging algorithm that identifies task-critical parameters based on their impact on model loss using only a small set of calibration samples.

Abstract

Recent advances in Vision-Language Models (VLMs) have motivated the development of multi-modal search agents that can actively invoke external search tools and integrate retrieved evidence through multi-step reasoning. While promising, existing approaches typically rely on large-scale supervised trajectories or expensive reinforcement learning (RL), leading to high training cost, instability, and a severe cold-start problem for standard VLMs. We propose a training-free paradigm to empower VLMs with autonomous search capabilities via cross-modal model merging. By fusing a text-based search agent with a base VLM, we show that multi-modal search capabilities can be effectively composed without any additional multi-modal training data. To mitigate parameter interference during cross-modal integration, we introduce Optimal Brain Merging (OBM), a saliency-aware merging algorithm that identifies task-critical parameters based on their impact on model loss using only a small set of calibration samples. Extensive experiments on search-intensive benchmarks (e.g., InfoSeek, MMSearch) reveal that: (1) Model merging secures a reasonable performance floor as a zero-shot agent, with OBM achieving superior search rates; (2) OBM significantly raises the performance ceiling as a warm-start strategy, achieving faster convergence and higher peak accuracy than standard VLM initialization.
Paper Structure (33 sections, 6 equations, 4 figures, 5 tables)

This paper contains 33 sections, 6 equations, 4 figures, 5 tables.

Figures (4)

  • Figure 1: Performance of the merging-based paradigm versus standard VLM training. The DA Baseline denotes Direct Answering. Compared to the standard base VLM (squares), model merging (diamonds) establishes a higher performance floor in zero-shot settings (blue) and further raises the ceiling after reinforcement learning (red).
  • Figure 2: Overview of the merging paradigm for constructing a multi-modal search agent. The merged model retains the frozen vision encoder and projector from the VLM, while its language module is obtained via parameter-level merging between the VLM and the LLM. As a result, the merged agent is able to answer multi-modal queries by autonomously searching external information.
  • Figure 3: Comparison of training accuracy in the early training phase across different base models.
  • Figure 4: Average accuracy and search rate on benchmarks at each model’s best checkpoint.