X-LLaVA: Optimizing Bilingual Large Vision-Language Alignment

Dongjae Shin; Hyeonseok Lim; Inho Won; Changsu Choi; Minjun Kim; Seungwoo Song; Hangyeol Yoo; Sangmin Kim; Kyungtae Lim

X-LLaVA: Optimizing Bilingual Large Vision-Language Alignment

Dongjae Shin, Hyeonseok Lim, Inho Won, Changsu Choi, Minjun Kim, Seungwoo Song, Hangyeol Yoo, Sangmin Kim, Kyungtae Lim

TL;DR

X-LLaVA addresses the high cost of multilingual vision-language data by coupling vocabulary expansion, cross-lingual pretraining, and a multilingual VIF data generation pipeline. The authors construct a 91K English–Korean–Chinese multimodal corpus and train a bilingual LMM that achieves strong performance in Korean and English, with notable gains in Korean and competitive results in English. The work demonstrates that multilingual VIF data, when paired with targeted vocabulary enrichment and staged pretraining, can yield robust cross-language vision-language alignment with modest compute and cost. This approach offers a practical path toward scalable multilingual LMMs suitable for real-world, multilingual usage scenarios.

Abstract

The impressive development of large language models (LLMs) is expanding into the realm of large multimodal models (LMMs), which incorporate multiple types of data beyond text. However, the nature of multimodal models leads to significant expenses in the creation of training data. Furthermore, constructing multilingual data for LMMs presents its own set of challenges due to language diversity and complexity. Therefore, in this study, we propose two cost-effective methods to solve this problem: (1) vocabulary expansion and pretraining of multilingual LLM for specific languages, and (2) automatic and elaborate construction of multimodal datasets using GPT4-V. Based on015 these methods, we constructed a 91K English-Korean-Chinese multilingual, multimodal training dataset. Additionally, we developed a bilingual multimodal model that exhibits excellent performance in both Korean and English, surpassing existing approaches.

X-LLaVA: Optimizing Bilingual Large Vision-Language Alignment

TL;DR

Abstract

Paper Structure (27 sections, 2 equations, 22 figures, 7 tables)

This paper contains 27 sections, 2 equations, 22 figures, 7 tables.

Introduction
Related Work
Vision-Language Models
Visual Instruction Following Datasets
Data Generation
The Focus of Data Building
Image Selection Criteria
Proposed VIF Dataset
Proposed Multilingual Model
Recap of LLaVA1.5
Enriching the LLM Vocabulary
X-LLaVA
Quantitative Evaluation
Experiment Environments
Intrinsic Evaluation of X-LLaVA
...and 12 more sections

Figures (22)

Figure 1: An example of prompt and result using data construction.
Figure 2: (a) Architecture of LLaVA1.5 & (b,c) The proposed language model pretraining
Figure 3: Korean Preference evaluation results by GPT4-V
Figure 4: English Preference evaluation results by GPT4-V
Figure 5: Korean Preference evaluation results by GPT4-V when limited to 30 Words.
...and 17 more figures

X-LLaVA: Optimizing Bilingual Large Vision-Language Alignment

TL;DR

Abstract

X-LLaVA: Optimizing Bilingual Large Vision-Language Alignment

Authors

TL;DR

Abstract

Table of Contents

Figures (22)