X-LLaVA: Optimizing Bilingual Large Vision-Language Alignment
Dongjae Shin, Hyeonseok Lim, Inho Won, Changsu Choi, Minjun Kim, Seungwoo Song, Hangyeol Yoo, Sangmin Kim, Kyungtae Lim
TL;DR
X-LLaVA addresses the high cost of multilingual vision-language data by coupling vocabulary expansion, cross-lingual pretraining, and a multilingual VIF data generation pipeline. The authors construct a 91K English–Korean–Chinese multimodal corpus and train a bilingual LMM that achieves strong performance in Korean and English, with notable gains in Korean and competitive results in English. The work demonstrates that multilingual VIF data, when paired with targeted vocabulary enrichment and staged pretraining, can yield robust cross-language vision-language alignment with modest compute and cost. This approach offers a practical path toward scalable multilingual LMMs suitable for real-world, multilingual usage scenarios.
Abstract
The impressive development of large language models (LLMs) is expanding into the realm of large multimodal models (LMMs), which incorporate multiple types of data beyond text. However, the nature of multimodal models leads to significant expenses in the creation of training data. Furthermore, constructing multilingual data for LMMs presents its own set of challenges due to language diversity and complexity. Therefore, in this study, we propose two cost-effective methods to solve this problem: (1) vocabulary expansion and pretraining of multilingual LLM for specific languages, and (2) automatic and elaborate construction of multimodal datasets using GPT4-V. Based on015 these methods, we constructed a 91K English-Korean-Chinese multilingual, multimodal training dataset. Additionally, we developed a bilingual multimodal model that exhibits excellent performance in both Korean and English, surpassing existing approaches.
