Unleashing the Potentials of Likelihood Composition for Multi-modal Language Models
Shitian Zhao, Renrui Zhang, Xu Luo, Yan Wang, Shanghang Zhang, Peng Gao
TL;DR
This work introduces likelihood composition, a post-hoc, training-free framework to fuse heterogeneous multi-modal language models by manipulating and combining candidate-answer log-likelihoods. It decomposes fusion into self-composition (debias, highlight) and mutual-composition (ensemble, majority-vote) and further integrates them into mix-composition, enabling flexible, on-the-fly model fusion. Empirical results across 9 VQA benchmarks and 10 MLMs show that self-composition yields strong gains for weaker models, while mix-composition consistently outperforms standard ensemble or majority-vote baselines, with notable gains on MMVP and MME. The findings suggest that model quality outweighs sheer quantity in fusion and offer a practical, architecture-agnostic approach to boost multi-modal visual question answering without retraining.
Abstract
Model fusing has always been an important topic, especially in an era where large language models (LLM) and multi-modal language models (MLM) with different architectures, parameter sizes and training pipelines, are being created all the time. In this work, we propose a post-hoc framework, aiming at fusing heterogeneous models off-the-shell, which we call \textit{likelihood composition}, and the basic idea is to compose multiple models' likelihood distribution when doing a multi-choice visual-question-answering task. Here the core concept, \textit{likelihood}, is actually the log-probability of the candidate answer. In \textit{likelihood composition}, we introduce some basic operations: \textit{debias}, \textit{highlight}, \textit{majority-vote} and \textit{ensemble}. By combining (composing) these basic elements, we get the mixed composition methods: \textit{mix-composition}. Through conducting comprehensive experiments on 9 VQA datasets and 10 MLMs, we prove the effectiveness of \textit{mix-composition} compared with simple \textit{ensemble} or \textit{majority-vote} methods. In this framework, people can propose new basic composition methods and combine them to get the new mixed composition methods. We hope our proposed \textit{likelihood composition} can provide a new perspective of fusing heterogeneous models and inspire the exploration under this framework.
