Table of Contents
Fetching ...

Unleashing the Potentials of Likelihood Composition for Multi-modal Language Models

Shitian Zhao, Renrui Zhang, Xu Luo, Yan Wang, Shanghang Zhang, Peng Gao

TL;DR

This work introduces likelihood composition, a post-hoc, training-free framework to fuse heterogeneous multi-modal language models by manipulating and combining candidate-answer log-likelihoods. It decomposes fusion into self-composition (debias, highlight) and mutual-composition (ensemble, majority-vote) and further integrates them into mix-composition, enabling flexible, on-the-fly model fusion. Empirical results across 9 VQA benchmarks and 10 MLMs show that self-composition yields strong gains for weaker models, while mix-composition consistently outperforms standard ensemble or majority-vote baselines, with notable gains on MMVP and MME. The findings suggest that model quality outweighs sheer quantity in fusion and offer a practical, architecture-agnostic approach to boost multi-modal visual question answering without retraining.

Abstract

Model fusing has always been an important topic, especially in an era where large language models (LLM) and multi-modal language models (MLM) with different architectures, parameter sizes and training pipelines, are being created all the time. In this work, we propose a post-hoc framework, aiming at fusing heterogeneous models off-the-shell, which we call \textit{likelihood composition}, and the basic idea is to compose multiple models' likelihood distribution when doing a multi-choice visual-question-answering task. Here the core concept, \textit{likelihood}, is actually the log-probability of the candidate answer. In \textit{likelihood composition}, we introduce some basic operations: \textit{debias}, \textit{highlight}, \textit{majority-vote} and \textit{ensemble}. By combining (composing) these basic elements, we get the mixed composition methods: \textit{mix-composition}. Through conducting comprehensive experiments on 9 VQA datasets and 10 MLMs, we prove the effectiveness of \textit{mix-composition} compared with simple \textit{ensemble} or \textit{majority-vote} methods. In this framework, people can propose new basic composition methods and combine them to get the new mixed composition methods. We hope our proposed \textit{likelihood composition} can provide a new perspective of fusing heterogeneous models and inspire the exploration under this framework.

Unleashing the Potentials of Likelihood Composition for Multi-modal Language Models

TL;DR

This work introduces likelihood composition, a post-hoc, training-free framework to fuse heterogeneous multi-modal language models by manipulating and combining candidate-answer log-likelihoods. It decomposes fusion into self-composition (debias, highlight) and mutual-composition (ensemble, majority-vote) and further integrates them into mix-composition, enabling flexible, on-the-fly model fusion. Empirical results across 9 VQA benchmarks and 10 MLMs show that self-composition yields strong gains for weaker models, while mix-composition consistently outperforms standard ensemble or majority-vote baselines, with notable gains on MMVP and MME. The findings suggest that model quality outweighs sheer quantity in fusion and offer a practical, architecture-agnostic approach to boost multi-modal visual question answering without retraining.

Abstract

Model fusing has always been an important topic, especially in an era where large language models (LLM) and multi-modal language models (MLM) with different architectures, parameter sizes and training pipelines, are being created all the time. In this work, we propose a post-hoc framework, aiming at fusing heterogeneous models off-the-shell, which we call \textit{likelihood composition}, and the basic idea is to compose multiple models' likelihood distribution when doing a multi-choice visual-question-answering task. Here the core concept, \textit{likelihood}, is actually the log-probability of the candidate answer. In \textit{likelihood composition}, we introduce some basic operations: \textit{debias}, \textit{highlight}, \textit{majority-vote} and \textit{ensemble}. By combining (composing) these basic elements, we get the mixed composition methods: \textit{mix-composition}. Through conducting comprehensive experiments on 9 VQA datasets and 10 MLMs, we prove the effectiveness of \textit{mix-composition} compared with simple \textit{ensemble} or \textit{majority-vote} methods. In this framework, people can propose new basic composition methods and combine them to get the new mixed composition methods. We hope our proposed \textit{likelihood composition} can provide a new perspective of fusing heterogeneous models and inspire the exploration under this framework.
Paper Structure (18 sections, 7 equations, 7 figures, 5 tables)

This paper contains 18 sections, 7 equations, 7 figures, 5 tables.

Figures (7)

  • Figure 1: Two categories of likelihood composition: self-composition and mutual-composition.
  • Figure 2: Prompt design of debias.
  • Figure 3: Prompt design of highlight.
  • Figure 4: (a) Applying debias on LLaVA series models with $\alpha$ ranging from 1.0 to 0.1. (b) Applying highlight on LLaVA series models with $\alpha$ ranging from 1.0 to 0.1.
  • Figure 5: This figure illustrates the results of applying debias and highlight on model A and model B, which are different models. More details could be found in Sec.\ref{['sec:heatmap']}. The x-axis represents model B and the y-axis represents model A.
  • ...and 2 more figures