Table of Contents
Fetching ...

Towards Cross-Lingual Explanation of Artwork in Large-scale Vision Language Models

Shintaro Ozaki, Kazuki Hayashi, Yusuke Sakai, Hidetaka Kamigaito, Katsuhiko Hayashi, Taro Watanabe

TL;DR

This work addresses the gap in cross-lingual artwork explanation for large-scale vision-language models by constructing a Wikipedia-derived, non–machine translation multilingual dataset across 10 languages and evaluating three task settings (Alignment-10, Alignment-5, Full). It demonstrates that LVLMs achieve the best explanation quality when both instructions and outputs are in English, with pronounced degradation when operating in non-English languages, highlighting limited cross-language transfer of knowledge learned from English data. The study analyzes the alignment between visual and linguistic knowledge, tests English-only instruction-tuning, and shows that multilingual pretraining of the Vision Encoder is needed to close the performance gap. It contributes a practical, multilingual evaluation framework and a public dataset to advance research on cross-lingual explanations in LVLMs, with implications for improving multilingual pretraining and evaluation protocols.

Abstract

As the performance of Large-scale Vision Language Models (LVLMs) improves, they are increasingly capable of responding in multiple languages, and there is an expectation that the demand for explanations generated by LVLMs will grow. However, pre-training of Vision Encoder and the integrated training of LLMs with Vision Encoder are mainly conducted using English training data, leaving it uncertain whether LVLMs can completely handle their potential when generating explanations in languages other than English. In addition, multilingual QA benchmarks that create datasets using machine translation have cultural differences and biases, remaining issues for use as evaluation tasks. To address these challenges, this study created an extended dataset in multiple languages without relying on machine translation. This dataset that takes into account nuances and country-specific phrases was then used to evaluate the generation explanation abilities of LVLMs. Furthermore, this study examined whether Instruction-Tuning in resource-rich English improves performance in other languages. Our findings indicate that LVLMs perform worse in languages other than English compared to English. In addition, it was observed that LVLMs struggle to effectively manage the knowledge learned from English data. Our dataset is available at https://huggingface.co/datasets/naist-nlp/MultiExpArt

Towards Cross-Lingual Explanation of Artwork in Large-scale Vision Language Models

TL;DR

This work addresses the gap in cross-lingual artwork explanation for large-scale vision-language models by constructing a Wikipedia-derived, non–machine translation multilingual dataset across 10 languages and evaluating three task settings (Alignment-10, Alignment-5, Full). It demonstrates that LVLMs achieve the best explanation quality when both instructions and outputs are in English, with pronounced degradation when operating in non-English languages, highlighting limited cross-language transfer of knowledge learned from English data. The study analyzes the alignment between visual and linguistic knowledge, tests English-only instruction-tuning, and shows that multilingual pretraining of the Vision Encoder is needed to close the performance gap. It contributes a practical, multilingual evaluation framework and a public dataset to advance research on cross-lingual explanations in LVLMs, with implications for improving multilingual pretraining and evaluation protocols.

Abstract

As the performance of Large-scale Vision Language Models (LVLMs) improves, they are increasingly capable of responding in multiple languages, and there is an expectation that the demand for explanations generated by LVLMs will grow. However, pre-training of Vision Encoder and the integrated training of LLMs with Vision Encoder are mainly conducted using English training data, leaving it uncertain whether LVLMs can completely handle their potential when generating explanations in languages other than English. In addition, multilingual QA benchmarks that create datasets using machine translation have cultural differences and biases, remaining issues for use as evaluation tasks. To address these challenges, this study created an extended dataset in multiple languages without relying on machine translation. This dataset that takes into account nuances and country-specific phrases was then used to evaluate the generation explanation abilities of LVLMs. Furthermore, this study examined whether Instruction-Tuning in resource-rich English improves performance in other languages. Our findings indicate that LVLMs perform worse in languages other than English compared to English. In addition, it was observed that LVLMs struggle to effectively manage the knowledge learned from English data. Our dataset is available at https://huggingface.co/datasets/naist-nlp/MultiExpArt
Paper Structure (46 sections, 4 equations, 6 figures, 31 tables)

This paper contains 46 sections, 4 equations, 6 figures, 31 tables.

Figures (6)

  • Figure 1: An example of situations that require multilingual and explanation skills.
  • Figure 2: How to make datasets from Wikipedia. As shown in Section \ref{['dataset creation']}, we extracted and filtered Wikipedia pages about artworks. We then manually identified pages with titles and images common across ten languages.
  • Figure 3: Some of the results in the Alignment-5 task. Purple bin indicates the method which is the instruction and the output in English ({En}-{En}), Green bin indicates the instruction in languages other than English and the output in English ({Lang}-{En}), Brown bin indicates the instruction and output in languages other than English ({Lang}-{Lang}) and Blue bin indicates the instruction in English and the output in languages other than English ({En}-{Lang}). From this figure, it can be seen that the English instructions are optimal, even if the number of data is expanded. We described further detailed results in Table \ref{['tab:result-score-5']} including Phi-3 and XComposer2. You can see the rest of the results in Figure \ref{['fig:a5-figure-appendix']} in the Appendix.
  • Figure 4: Visualization of Alignment-10 results in a heat map. We made the visualization based on when we had LVLMs give instructions and output in English.
  • Figure 5: Visualization of Alignment-10 results in a heat map. We made the visualization based on when we had LVLMs give instructions and output in English.
  • ...and 1 more figures