Table of Contents
Fetching ...

Visual question answering: from early developments to recent advances -- a survey

Ngoc Dung Huynh, Mohamed Reda Bouadjenek, Sunil Aryal, Imran Razzak, Hakim Hacid

TL;DR

This survey traces the evolution of Visual Question Answering from early fusion-based architectures to modern large visual-language models (LVLMs), detailing Vision and Language encoders, fusion strategies, and open-vocabulary decoders. It surveys key datasets (e.g., VQA, Visual Genome, CLEVR, GQA) and metrics (accuracy variants, WUP, human-like evaluation), and discusses domain-specific applications in medicine, accessibility, and education. A major emphasis is placed on LVLMs as a unifying framework that enables zero-shot and few-shot VQA across diverse tasks, while identifying challenges in dataset scale, reasoning, and real-world deployment. The paper highlights practical implications for education and healthcare and points to future work in efficient cross-modal pretraining, adaptable adapters, and robust evaluation beyond traditional metrics.

Abstract

Visual Question Answering (VQA) is an evolving research field aimed at enabling machines to answer questions about visual content by integrating image and language processing techniques such as feature extraction, object detection, text embedding, natural language understanding, and language generation. With the growth of multimodal data research, VQA has gained significant attention due to its broad applications, including interactive educational tools, medical image diagnosis, customer service, entertainment, and social media captioning. Additionally, VQA plays a vital role in assisting visually impaired individuals by generating descriptive content from images. This survey introduces a taxonomy of VQA architectures, categorizing them based on design choices and key components to facilitate comparative analysis and evaluation. We review major VQA approaches, focusing on deep learning-based methods, and explore the emerging field of Large Visual Language Models (LVLMs) that have demonstrated success in multimodal tasks like VQA. The paper further examines available datasets and evaluation metrics essential for measuring VQA system performance, followed by an exploration of real-world VQA applications. Finally, we highlight ongoing challenges and future directions in VQA research, presenting open questions and potential areas for further development. This survey serves as a comprehensive resource for researchers and practitioners interested in the latest advancements and future

Visual question answering: from early developments to recent advances -- a survey

TL;DR

This survey traces the evolution of Visual Question Answering from early fusion-based architectures to modern large visual-language models (LVLMs), detailing Vision and Language encoders, fusion strategies, and open-vocabulary decoders. It surveys key datasets (e.g., VQA, Visual Genome, CLEVR, GQA) and metrics (accuracy variants, WUP, human-like evaluation), and discusses domain-specific applications in medicine, accessibility, and education. A major emphasis is placed on LVLMs as a unifying framework that enables zero-shot and few-shot VQA across diverse tasks, while identifying challenges in dataset scale, reasoning, and real-world deployment. The paper highlights practical implications for education and healthcare and points to future work in efficient cross-modal pretraining, adaptable adapters, and robust evaluation beyond traditional metrics.

Abstract

Visual Question Answering (VQA) is an evolving research field aimed at enabling machines to answer questions about visual content by integrating image and language processing techniques such as feature extraction, object detection, text embedding, natural language understanding, and language generation. With the growth of multimodal data research, VQA has gained significant attention due to its broad applications, including interactive educational tools, medical image diagnosis, customer service, entertainment, and social media captioning. Additionally, VQA plays a vital role in assisting visually impaired individuals by generating descriptive content from images. This survey introduces a taxonomy of VQA architectures, categorizing them based on design choices and key components to facilitate comparative analysis and evaluation. We review major VQA approaches, focusing on deep learning-based methods, and explore the emerging field of Large Visual Language Models (LVLMs) that have demonstrated success in multimodal tasks like VQA. The paper further examines available datasets and evaluation metrics essential for measuring VQA system performance, followed by an exploration of real-world VQA applications. Finally, we highlight ongoing challenges and future directions in VQA research, presenting open questions and potential areas for further development. This survey serves as a comprehensive resource for researchers and practitioners interested in the latest advancements and future
Paper Structure (46 sections, 3 equations, 4 figures, 3 tables)

This paper contains 46 sections, 3 equations, 4 figures, 3 tables.

Figures (4)

  • Figure 1: Overview of a VQA system.
  • Figure 2: General VQA system.
  • Figure 3: VQA Taxonomy.
  • Figure 4: The accuracy of the model on different datasets over the years (from May 2015 to December 2022). Note: the results on the VQA v1.0 and VQA v2.0 sets are from the test-std sets. * represents the results from the test-dev set. The different colored lines represent the state-of-the-art models in the datasets over time.