Table of Contents
Fetching ...

Constructing Multilingual Visual-Text Datasets Revealing Visual Multilingual Ability of Vision Language Models

Jesse Atuhurra, Iqra Ali, Tatsuya Hiraoka, Hidetaka Kamigaito, Tomoya Iwakura, Taro Watanabe

TL;DR

It is shown that VLMs can be fine-tuned on the authors' datasets, and this work is the first to conduct such analyses in Swahili and Urdu, and introduces rationales in VL analysis, which played a vital role in the evaluation.

Abstract

Large language models (LLMs) have increased interest in vision language models (VLMs), which process image-text pairs as input. Studies investigating the visual understanding ability of VLMs have been proposed, but such studies are still preliminary because existing datasets do not permit a comprehensive evaluation of the fine-grained visual linguistic abilities of VLMs across multiple languages. To further explore the strengths of VLMs, such as GPT-4V \cite{openai2023GPT4}, we developed new datasets for the systematic and qualitative analysis of VLMs. Our contribution is four-fold: 1) we introduced nine vision-and-language (VL) tasks (including object recognition, image-text matching, and more) and constructed multilingual visual-text datasets in four languages: English, Japanese, Swahili, and Urdu through utilizing templates containing \textit{questions} and prompting GPT4-V to generate the \textit{answers} and the \textit{rationales}, 2) introduced a new VL task named \textit{unrelatedness}, 3) introduced rationales to enable human understanding of the VLM reasoning process, and 4) employed human evaluation to measure the suitability of proposed datasets for VL tasks. We show that VLMs can be fine-tuned on our datasets. Our work is the first to conduct such analyses in Swahili and Urdu. Also, it introduces \textit{rationales} in VL analysis, which played a vital role in the evaluation.

Constructing Multilingual Visual-Text Datasets Revealing Visual Multilingual Ability of Vision Language Models

TL;DR

It is shown that VLMs can be fine-tuned on the authors' datasets, and this work is the first to conduct such analyses in Swahili and Urdu, and introduces rationales in VL analysis, which played a vital role in the evaluation.

Abstract

Large language models (LLMs) have increased interest in vision language models (VLMs), which process image-text pairs as input. Studies investigating the visual understanding ability of VLMs have been proposed, but such studies are still preliminary because existing datasets do not permit a comprehensive evaluation of the fine-grained visual linguistic abilities of VLMs across multiple languages. To further explore the strengths of VLMs, such as GPT-4V \cite{openai2023GPT4}, we developed new datasets for the systematic and qualitative analysis of VLMs. Our contribution is four-fold: 1) we introduced nine vision-and-language (VL) tasks (including object recognition, image-text matching, and more) and constructed multilingual visual-text datasets in four languages: English, Japanese, Swahili, and Urdu through utilizing templates containing \textit{questions} and prompting GPT4-V to generate the \textit{answers} and the \textit{rationales}, 2) introduced a new VL task named \textit{unrelatedness}, 3) introduced rationales to enable human understanding of the VLM reasoning process, and 4) employed human evaluation to measure the suitability of proposed datasets for VL tasks. We show that VLMs can be fine-tuned on our datasets. Our work is the first to conduct such analyses in Swahili and Urdu. Also, it introduces \textit{rationales} in VL analysis, which played a vital role in the evaluation.
Paper Structure (43 sections, 10 figures, 6 tables)

This paper contains 43 sections, 10 figures, 6 tables.

Figures (10)

  • Figure 1: To study the visual and multilingual abilities of VLMs (i.e., GPT-4V), we introduced nine tasks. The input to GPT-4V includes image, text, questions while it outputs an answer, rationale pair. We repeated this process in English (En), Japanese (Jp), Urdu (Ur), Swahili (Sw), and constructed datasets in these four languages. (An expanded version in Figure \ref{['fig:IntroFigureExpanded']} in Appendix \ref{['Appendix:IntroFigureExpanded']}).
  • Figure 2: During dataset construction, the VLM input is a prompt which consists of text, image, question. The output from the VLM is the answer, rationale pair. Humans, specifically native speakers of En, Jp, Sw, Ur respectively, evaluate the quality of the answer taking into consideration the rationale generated by the VLM.
  • Figure 3: We gathered one image and all the text available in a Wikinews article. This is the definition of an image-text pair in our study.
  • Figure 4: The original image from our data, image i. is shown on the left. The augmented images image ii, image iii, image iv, image v, and image vi are shown on the right.
  • Figure 5: The performance of GPT-4V across all tasks as rated by native speakers of En, Jp, Sw, Ur. The scores are the normalized 5-scale Likert scores.
  • ...and 5 more figures