xGen-MM (BLIP-3): A Family of Open Large Multimodal Models

Le Xue; Manli Shu; Anas Awadalla; Jun Wang; An Yan; Senthil Purushwalkam; Honglu Zhou; Viraj Prabhu; Yutong Dai; Michael S Ryoo; Shrikant Kendre; Jieyu Zhang; Shaoyen Tseng; Gustavo A Lujan-Moreno; Matthew L Olson; Musashi Hinck; David Cobbley; Vasudev Lal; Can Qin; Shu Zhang; Chia-Chih Chen; Ning Yu; Juntao Tan; Tulika Manoj Awalgaonkar; Shelby Heinecke; Huan Wang; Yejin Choi; Ludwig Schmidt; Zeyuan Chen; Silvio Savarese; Juan Carlos Niebles; Caiming Xiong; Ran Xu

Paper

xGen-MM (BLIP-3): A Family of Open Large Multimodal Models

Abstract

This paper introduces BLIP-3, an open framework for developing Large Multimodal Models (LMMs). The framework comprises meticulously curated datasets, a training recipe, model architectures, and a resulting suite of LMMs. We release 4B and 14B models, including both the pre-trained base model and the instruction fine-tuned ones. Our models undergo rigorous evaluation across a range of tasks, including both single and multi-image benchmarks. Our models demonstrate competitive performance among open-source LMMs with similar model sizes. Our resulting LMMs demonstrate competitive performance among open-source LMMs with similar model sizes, with the ability to comprehend interleaved image-text inputs. Our training code, models, and all datasets used in this work, including the three largescale datasets we create and the preprocessed ones, will be open-sourced to better support the research community.