Metropolis-Hastings Captioning Game: Knowledge Fusion of Vision Language Models via Decentralized Bayesian Inference
Yuta Matsui, Ryosuke Yamaki, Ryo Ueda, Seitaro Shinagawa, Tadahiro Taniguchi
TL;DR
The paper presents the Metropolis-Hastings Captioning Game (MHCG), a decentralized Bayesian framework for fusing knowledge across vision-language models (VLMs) by having agents alternately propose and judge captions for images. By casting the fusion as a Metropolis-Hastings-based communication game within an Inter-ProbVLM setup, MHCG enables knowledge transfer while mitigating catastrophic forgetting through accept/reject judgments and replay-based continual updates. Two experiments—with VLMs pretrained on distinct datasets (CC3M and COCO)—show that MHCG improves reference-free captioning metrics and enhances cross-dataset vocabulary transfer without incurring the high inference costs of ensembles or the detrimental forgetting seen in naive fine-tuning. A second experiment with category-level COCO splits demonstrates that MHCG preserves own-domain vocabulary while progressively learning counterpart categories, achieving superior category-level F1 scores and competitive generation time. Overall, MHCG offers a scalable, decentralized approach to knowledge fusion across VLMs, with potential extensions to multiple agents and multilingual scenarios.
Abstract
We propose the Metropolis-Hastings Captioning Game (MHCG), a method to fuse knowledge of multiple vision-language models (VLMs) by learning from each other. Although existing methods that combine multiple models suffer from inference costs and architectural constraints, MHCG avoids these problems by performing decentralized Bayesian inference through a process resembling a language game. The knowledge fusion process establishes communication between two VLM agents alternately captioning images and learning from each other. We conduct two image-captioning experiments with two VLMs, each pre-trained on a different dataset. The first experiment demonstrates that MHCG achieves consistent improvement in reference-free evaluation metrics. The second experiment investigates how MHCG contributes to sharing VLMs' category-level vocabulary by observing the occurrence of the vocabulary in the generated captions.
