Table of Contents
Fetching ...

BOK-VQA: Bilingual outside Knowledge-Based Visual Question Answering via Graph Representation Pretraining

Minjun Kim, Seungwoo Song, Youhan Lee, Haneol Jang, Kyungtae Lim

TL;DR

This work tackles multilingual VQA by introducing BOK-VQA, a Korean-English dataset with 282,533 KB triples, and a language-independent KG embedding framework (GEL-VQA) to inject external knowledge into VQA. It leverages English KGs via ConvKB embeddings and a triple-prediction module to bridge questions in Korean to English knowledge, enabling cross-lingual knowledge grounding. Empirical results show substantial gains over baselines, with robust performance in less-resourced languages and evidence that KG embeddings can be effectively language-agnostic, especially when combined with multitask training. The findings advance multilingual VQA by providing a scalable, bilingual KB-VQA resource and a practical strategy for integrating KG information into multimodal reasoning systems.

Abstract

The current research direction in generative models, such as the recently developed GPT4, aims to find relevant knowledge information for multimodal and multilingual inputs to provide answers. Under these research circumstances, the demand for multilingual evaluation of visual question answering (VQA) tasks, a representative task of multimodal systems, has increased. Accordingly, we propose a bilingual outside-knowledge VQA (BOK-VQA) dataset in this study that can be extended to multilingualism. The proposed data include 17K images, 17K question-answer pairs for both Korean and English and 280K instances of knowledge information related to question-answer content. We also present a framework that can effectively inject knowledge information into a VQA system by pretraining the knowledge information of BOK-VQA data in the form of graph embeddings. Finally, through in-depth analysis, we demonstrated the actual effect of the knowledge information contained in the constructed training data on VQA.

BOK-VQA: Bilingual outside Knowledge-Based Visual Question Answering via Graph Representation Pretraining

TL;DR

This work tackles multilingual VQA by introducing BOK-VQA, a Korean-English dataset with 282,533 KB triples, and a language-independent KG embedding framework (GEL-VQA) to inject external knowledge into VQA. It leverages English KGs via ConvKB embeddings and a triple-prediction module to bridge questions in Korean to English knowledge, enabling cross-lingual knowledge grounding. Empirical results show substantial gains over baselines, with robust performance in less-resourced languages and evidence that KG embeddings can be effectively language-agnostic, especially when combined with multitask training. The findings advance multilingual VQA by providing a scalable, bilingual KB-VQA resource and a practical strategy for integrating KG information into multimodal reasoning systems.

Abstract

The current research direction in generative models, such as the recently developed GPT4, aims to find relevant knowledge information for multimodal and multilingual inputs to provide answers. Under these research circumstances, the demand for multilingual evaluation of visual question answering (VQA) tasks, a representative task of multimodal systems, has increased. Accordingly, we propose a bilingual outside-knowledge VQA (BOK-VQA) dataset in this study that can be extended to multilingualism. The proposed data include 17K images, 17K question-answer pairs for both Korean and English and 280K instances of knowledge information related to question-answer content. We also present a framework that can effectively inject knowledge information into a VQA system by pretraining the knowledge information of BOK-VQA data in the form of graph embeddings. Finally, through in-depth analysis, we demonstrated the actual effect of the knowledge information contained in the constructed training data on VQA.
Paper Structure (23 sections, 29 equations, 15 figures, 8 tables)

This paper contains 23 sections, 29 equations, 15 figures, 8 tables.

Figures (15)

  • Figure 1: Example of BOK-VQA sample
  • Figure 2: Architecture of VQA module.
  • Figure 3: Architecture of Triple Prediction Module.
  • Figure 4: GEL-VQA architecture: (a) VQA module, (b) triple prediction module, and (c) pretrained KGE.
  • Figure 5: Visualizing attention score of H-Given case.
  • ...and 10 more figures