Kvasir-VQA: A Text-Image Pair GI Tract Dataset

Sushant Gautam; Andrea Storås; Cise Midoglu; Steven A. Hicks; Vajira Thambawita; Pål Halvorsen; Michael A. Riegler

Kvasir-VQA: A Text-Image Pair GI Tract Dataset

Sushant Gautam, Andrea Storås, Cise Midoglu, Steven A. Hicks, Vajira Thambawita, Pål Halvorsen, Michael A. Riegler

Abstract

We introduce Kvasir-VQA, an extended dataset derived from the HyperKvasir and Kvasir-Instrument datasets, augmented with question-and-answer annotations to facilitate advanced machine learning tasks in Gastrointestinal (GI) diagnostics. This dataset comprises 6,500 annotated images spanning various GI tract conditions and surgical instruments, and it supports multiple question types including yes/no, choice, location, and numerical count. The dataset is intended for applications such as image captioning, Visual Question Answering (VQA), text-based generation of synthetic medical images, object detection, and classification. Our experiments demonstrate the dataset's effectiveness in training models for three selected tasks, showcasing significant applications in medical image analysis and diagnostics. We also present evaluation metrics for each task, highlighting the usability and versatility of our dataset. The dataset and supporting artifacts are available at https://datasets.simula.no/kvasir-vqa.

Kvasir-VQA: A Text-Image Pair GI Tract Dataset

Abstract

Paper Structure (19 sections, 3 figures, 3 tables)

This paper contains 19 sections, 3 figures, 3 tables.

Introduction
Background and Related Work
Gastrointestinal Image Datasets
Image Captioning
Visual Question Answering
Synthetic Medical Image Generation
The Kvasir-VQA Dataset
Dataset Sources
Annotation Process
Final Dataset
Experiments
Image Captioning
Visual Question Answering
Synthetic Medical Image Generation
Discussion
...and 4 more sections

Figures (3)

Figure 1: [Task 1] An example from the fine-tuned captioning model, generating five captions for the given input image.
Figure 2: [Task 2] An example from the fine-tuned model answering questions about the input image.
Figure 3: [Task 3] An example from the fine-tuned synthetic medical image generation model, which generated five images for the given prompt.

Kvasir-VQA: A Text-Image Pair GI Tract Dataset

Abstract

Kvasir-VQA: A Text-Image Pair GI Tract Dataset

Authors

Abstract

Table of Contents

Figures (3)