Integrating Image Features with Convolutional Sequence-to-sequence Network for Multilingual Visual Question Answering
Triet Minh Thai, Son T. Luu
TL;DR
This work tackles multilingual visual question answering on UIT-EVJVQA by reframing VQA as a sequence-to-sequence task that leverages image features and hints from pre-trained vision-language models. The authors propose a two-phase approach: first, extract hints using ViLT and OFA in a zero-shot setting (with translation steps for Vietnamese and Japanese questions), then train a ConvS2S model on concatenated sequences of questions, hints, and image features (ViT patches) to generate free-form answers. They demonstrate that incorporating hints and image features improves performance, achieving a private-test F1 of 0.4210 and a public-test F1 of 0.3442, ranking 3rd in VLSP-EVJVQA. Qualitative analyses, including attention visualizations and error analysis, provide insights into how hints influence attention and where the system struggles across languages. The study suggests future work with additional vision-language models (e.g., BEiT, DeiT, CLIP) and potential real-world applications like multilingual image-based chatbots.
Abstract
Visual Question Answering (VQA) is a task that requires computers to give correct answers for the input questions based on the images. This task can be solved by humans with ease but is a challenge for computers. The VLSP2022-EVJVQA shared task carries the Visual Question Answering task in the multilingual domain on a newly released dataset: UIT-EVJVQA, in which the questions and answers are written in three different languages: English, Vietnamese and Japanese. We approached the challenge as a sequence-to-sequence learning task, in which we integrated hints from pre-trained state-of-the-art VQA models and image features with Convolutional Sequence-to-Sequence network to generate the desired answers. Our results obtained up to 0.3442 by F1 score on the public test set, 0.4210 on the private test set, and placed 3rd in the competition.
