Table of Contents
Fetching ...

Integrating Image Features with Convolutional Sequence-to-sequence Network for Multilingual Visual Question Answering

Triet Minh Thai, Son T. Luu

TL;DR

This work tackles multilingual visual question answering on UIT-EVJVQA by reframing VQA as a sequence-to-sequence task that leverages image features and hints from pre-trained vision-language models. The authors propose a two-phase approach: first, extract hints using ViLT and OFA in a zero-shot setting (with translation steps for Vietnamese and Japanese questions), then train a ConvS2S model on concatenated sequences of questions, hints, and image features (ViT patches) to generate free-form answers. They demonstrate that incorporating hints and image features improves performance, achieving a private-test F1 of 0.4210 and a public-test F1 of 0.3442, ranking 3rd in VLSP-EVJVQA. Qualitative analyses, including attention visualizations and error analysis, provide insights into how hints influence attention and where the system struggles across languages. The study suggests future work with additional vision-language models (e.g., BEiT, DeiT, CLIP) and potential real-world applications like multilingual image-based chatbots.

Abstract

Visual Question Answering (VQA) is a task that requires computers to give correct answers for the input questions based on the images. This task can be solved by humans with ease but is a challenge for computers. The VLSP2022-EVJVQA shared task carries the Visual Question Answering task in the multilingual domain on a newly released dataset: UIT-EVJVQA, in which the questions and answers are written in three different languages: English, Vietnamese and Japanese. We approached the challenge as a sequence-to-sequence learning task, in which we integrated hints from pre-trained state-of-the-art VQA models and image features with Convolutional Sequence-to-Sequence network to generate the desired answers. Our results obtained up to 0.3442 by F1 score on the public test set, 0.4210 on the private test set, and placed 3rd in the competition.

Integrating Image Features with Convolutional Sequence-to-sequence Network for Multilingual Visual Question Answering

TL;DR

This work tackles multilingual visual question answering on UIT-EVJVQA by reframing VQA as a sequence-to-sequence task that leverages image features and hints from pre-trained vision-language models. The authors propose a two-phase approach: first, extract hints using ViLT and OFA in a zero-shot setting (with translation steps for Vietnamese and Japanese questions), then train a ConvS2S model on concatenated sequences of questions, hints, and image features (ViT patches) to generate free-form answers. They demonstrate that incorporating hints and image features improves performance, achieving a private-test F1 of 0.4210 and a public-test F1 of 0.3442, ranking 3rd in VLSP-EVJVQA. Qualitative analyses, including attention visualizations and error analysis, provide insights into how hints influence attention and where the system struggles across languages. The study suggests future work with additional vision-language models (e.g., BEiT, DeiT, CLIP) and potential real-world applications like multilingual image-based chatbots.

Abstract

Visual Question Answering (VQA) is a task that requires computers to give correct answers for the input questions based on the images. This task can be solved by humans with ease but is a challenge for computers. The VLSP2022-EVJVQA shared task carries the Visual Question Answering task in the multilingual domain on a newly released dataset: UIT-EVJVQA, in which the questions and answers are written in three different languages: English, Vietnamese and Japanese. We approached the challenge as a sequence-to-sequence learning task, in which we integrated hints from pre-trained state-of-the-art VQA models and image features with Convolutional Sequence-to-Sequence network to generate the desired answers. Our results obtained up to 0.3442 by F1 score on the public test set, 0.4210 on the private test set, and placed 3rd in the competition.
Paper Structure (21 sections, 7 figures, 6 tables)

This paper contains 21 sections, 7 figures, 6 tables.

Figures (7)

  • Figure 1: Multilingual samples from UIT-EVJVQA dataset. From top to bottom, left to right: English (en), Vietnamese (vi) and Japanese (ja). The dataset contains a wide variety of questions; in some cases, the image contains noises that makes it difficult for a computer to distinguish the indicated object or action, for instance, "phones" in the English example or the action of "the man in green shirt" in the Vietnamese case. Besides, the Japanese example provides a tough scenario, "which hand is the girl putting onto the water?" in English, that even humans find it challenging to deliver the proper response.
  • Figure 2: An overview of the proposed method for visual question answering on UIT-EVJVQA dataset
  • Figure 3: An example of question and hints combination. The hint 'tree' occurs 11 times in the sequence since the half of its probality is 11.19 (%).
  • Figure 4: Training loss and public testing loss comparison of ConvS2S model with different combinations of hint and image features.
  • Figure 5: Distributions of F1 and BLEU scores for each language from 100 generated samples
  • ...and 2 more figures