Table of Contents
Fetching ...

mBLIP: Efficient Bootstrapping of Multilingual Vision-LLMs

Gregor Geigle, Abhay Jain, Radu Timofte, Goran Glavaš

TL;DR

mBLIP introduces an efficient, modular approach to multilingual Vision-LLMs by re-aligning an English BLIP-2 image encoder to a multilingual LLM using a compact, MT-generated training mix and parametric efficiency techniques. The method trains only a small portion of parameters (via LoRA) and leverages 8-bit quantization to run on consumer hardware, requiring roughly 2.5 million images and 124 million trainable parameters. Evaluations across captioning and vision-language tasks in 95 languages show competitive results with state-of-the-art multilingual models and clear advantages over English-only Vision-LLMs in non-English settings, demonstrating strong cross-lingual transfer and practical scalability. The work provides an accessible path to deploying multilingual Vision-LLMs and includes releases of model, code, and data to support further research and applications.

Abstract

Modular vision-language models (Vision-LLMs) align pretrained image encoders with (frozen) large language models (LLMs) and post-hoc condition LLMs to `understand' the image input. With the abundance of readily available high-quality English image-text data as well as strong monolingual English LLMs, the research focus has been on English-only Vision-LLMs. Multilingual vision-language models are still predominantly obtained via expensive end-to-end pretraining, resulting in comparatively smaller models, trained on limited multilingual image data supplemented with text-only multilingual corpora. We present mBLIP, the first Vision-LLM leveraging multilingual LLMs, which we obtain in a computationally efficient manner on consumer-level hardware. To this end, we \textit{re-align} an image encoder previously tuned to an English LLM to a new, multilingual LLM using only a few million multilingual training examples derived from a mix of vision-and-language tasks, which we obtain by machine-translating high-quality English data to 95 languages. On the IGLUE benchmark and XM3600, mBLIP yields results competitive with state-of-the-art models and it greatly outperforms strong English-only Vision-LLMs like Llava 1.5. We release our model, code, and train data at \url{https://github.com/gregor-ge/mBLIP}.

mBLIP: Efficient Bootstrapping of Multilingual Vision-LLMs

TL;DR

mBLIP introduces an efficient, modular approach to multilingual Vision-LLMs by re-aligning an English BLIP-2 image encoder to a multilingual LLM using a compact, MT-generated training mix and parametric efficiency techniques. The method trains only a small portion of parameters (via LoRA) and leverages 8-bit quantization to run on consumer hardware, requiring roughly 2.5 million images and 124 million trainable parameters. Evaluations across captioning and vision-language tasks in 95 languages show competitive results with state-of-the-art multilingual models and clear advantages over English-only Vision-LLMs in non-English settings, demonstrating strong cross-lingual transfer and practical scalability. The work provides an accessible path to deploying multilingual Vision-LLMs and includes releases of model, code, and data to support further research and applications.

Abstract

Modular vision-language models (Vision-LLMs) align pretrained image encoders with (frozen) large language models (LLMs) and post-hoc condition LLMs to `understand' the image input. With the abundance of readily available high-quality English image-text data as well as strong monolingual English LLMs, the research focus has been on English-only Vision-LLMs. Multilingual vision-language models are still predominantly obtained via expensive end-to-end pretraining, resulting in comparatively smaller models, trained on limited multilingual image data supplemented with text-only multilingual corpora. We present mBLIP, the first Vision-LLM leveraging multilingual LLMs, which we obtain in a computationally efficient manner on consumer-level hardware. To this end, we \textit{re-align} an image encoder previously tuned to an English LLM to a new, multilingual LLM using only a few million multilingual training examples derived from a mix of vision-and-language tasks, which we obtain by machine-translating high-quality English data to 95 languages. On the IGLUE benchmark and XM3600, mBLIP yields results competitive with state-of-the-art models and it greatly outperforms strong English-only Vision-LLMs like Llava 1.5. We release our model, code, and train data at \url{https://github.com/gregor-ge/mBLIP}.
Paper Structure (28 sections, 3 figures, 14 tables)

This paper contains 28 sections, 3 figures, 14 tables.

Figures (3)

  • Figure 1: The mBLIP architecture: A Q-Former encodes the image in learned query tokens which are projected to the LLM space. We initialize the Q-Former from a BLIP-2 model and re-align it to the multilingual LLM with a multilingual task mix. The image encoder and LLM (aside from LoRA weights) are frozen during training.
  • Figure 2: Cross-lingual transfer of models fine-tuned on English. The smaller gap of mBLIP mT0 between high- and low-resource languages suggests better transfer capabilities. (CCLM 4M from zeng_cross-view_2022 v1 on arXiv.)
  • Figure 3: Multilingual examples (with translations from Google Translate in parentheses). While the first row shows that the model can handle captioning and QA in diverse languages, the second row shows some failure cases. We use beam search (5 beams) with a repetition penalty of 1.5.