TM-PATHVQA:90000+ Textless Multilingual Questions for Medical Visual Question Answering

Tonmoy Rajkhowa; Amartya Roy Chowdhury; Sankalp Nagaonkar; Achyut Mani Tripathi

TM-PATHVQA:90000+ Textless Multilingual Questions for Medical Visual Question Answering

Tonmoy Rajkhowa, Amartya Roy Chowdhury, Sankalp Nagaonkar, Achyut Mani Tripathi

TL;DR

This work tackles the lack of hands-free, multilingual VQA support for medical images by introducing TM-PathVQA, a textless dataset with spoken questions in English, German, and French tied to PathVQA images. The dataset is constructed by translating PathVQA questions with SeamlessM4T and pairing them with 70 hours of audio, yielding 98,397 QAs across 5,004 images. A Multi-Modal Learning framework fuses audio and image features through a Transformer encoder to evaluate multiple feature combinations for Yes/No and open-ended questions. Results show that speech-based VQA with Hu-BERT audio features and Faster RCNN image features outperforms text-based baselines across languages, highlighting the practicality of spoken, multilingual medical VQA and providing a foundation for future attention-based MML approaches.

Abstract

In healthcare and medical diagnostics, Visual Question Answering (VQA) mayemergeasapivotal tool in scenarios where analysis of intricate medical images becomes critical for accurate diagnoses. Current text-based VQA systems limit their utility in scenarios where hands-free interaction and accessibility are crucial while performing tasks. A speech-based VQA system may provide a better means of interaction where information can be accessed while performing tasks simultaneously. To this end, this work implements a speech-based VQA system by introducing a Textless Multilingual Pathological VQA (TMPathVQA) dataset, an expansion of the PathVQA dataset, containing spoken questions in English, German & French. This dataset comprises 98,397 multilingual spoken questions and answers based on 5,004 pathological images along with 70 hours of audio. Finally, this work benchmarks and compares TMPathVQA systems implemented using various combinations of acoustic and visual features.

TM-PATHVQA:90000+ Textless Multilingual Questions for Medical Visual Question Answering

TL;DR

Abstract

Paper Structure (11 sections, 1 figure, 5 tables)

This paper contains 11 sections, 1 figure, 5 tables.

Introduction
TM-PathVQA Dataset
Experimental Methodology
Representations for Different Modalities
Text Representation
Audio Representation
Visual Representation
MML Framework for TM-PathVQA
Evaluation Metrics
Results & Discussions
Conclusion

Figures (1)

Figure 1: (a) Pathological Image and Spectrogram Visualization of Corresponding Speech-Based Question & Answer in (b) English (Q:" Where are liver stem cells (oval cells) located?", Ans:In the canals of hering.), (c) German (Q:" Wo befinden sich Leberstammzellen (ovale Zellen)?", Ans:In the canals of hering.) and (d) French (Q:" Où se trouvent les cellules souches hépatiques (cellules ovales)?", Ans:In the canals of hering.) Languages from the TM-PathVQA Dataset.

TM-PATHVQA:90000+ Textless Multilingual Questions for Medical Visual Question Answering

TL;DR

Abstract

TM-PATHVQA:90000+ Textless Multilingual Questions for Medical Visual Question Answering

Authors

TL;DR

Abstract

Table of Contents

Figures (1)