Table of Contents
Fetching ...

Q-FSRU: Quantum-Augmented Frequency-Spectral For Medical Visual Question Answering

Rakesh Thakur, Yusra Tariq, Rakesh Chandra Joshi

TL;DR

This work tackles medical visual question answering by introducing Q-FSRU, which fuses frequency-domain representations with a quantum-inspired retrieval augmentation to ground reasoning in external medical knowledge. By transforming image and text features into frequency spectra via FFT, applying learnable spectral compression, and using a fidelity-based quantum retrieval (Quantum RAG), the model captures global patterns and nuanced clinical relationships. Evaluation on VQA-RAD shows state-of-the-art accuracy and robust cross-dataset generalization to PathVQA, with ablations confirming the critical roles of frequency processing, quantum retrieval, and dual contrastive learning. The approach enhances interpretability and offers a promising path for clinically deployable AI tools in radiology and pathology.

Abstract

Solving tough clinical questions that require both image and text understanding is still a major challenge in healthcare AI. In this work, we propose Q-FSRU, a new model that combines Frequency Spectrum Representation and Fusion (FSRU) with a method called Quantum Retrieval-Augmented Generation (Quantum RAG) for medical Visual Question Answering (VQA). The model takes in features from medical images and related text, then shifts them into the frequency domain using Fast Fourier Transform (FFT). This helps it focus on more meaningful data and filter out noise or less useful information. To improve accuracy and ensure that answers are based on real knowledge, we add a quantum inspired retrieval system. It fetches useful medical facts from external sources using quantum-based similarity techniques. These details are then merged with the frequency-based features for stronger reasoning. We evaluated our model using the VQA-RAD dataset, which includes real radiology images and questions. The results showed that Q-FSRU outperforms earlier models, especially on complex cases needing image text reasoning. The mix of frequency and quantum information improves both performance and explainability. Overall, this approach offers a promising way to build smart, clear, and helpful AI tools for doctors.

Q-FSRU: Quantum-Augmented Frequency-Spectral For Medical Visual Question Answering

TL;DR

This work tackles medical visual question answering by introducing Q-FSRU, which fuses frequency-domain representations with a quantum-inspired retrieval augmentation to ground reasoning in external medical knowledge. By transforming image and text features into frequency spectra via FFT, applying learnable spectral compression, and using a fidelity-based quantum retrieval (Quantum RAG), the model captures global patterns and nuanced clinical relationships. Evaluation on VQA-RAD shows state-of-the-art accuracy and robust cross-dataset generalization to PathVQA, with ablations confirming the critical roles of frequency processing, quantum retrieval, and dual contrastive learning. The approach enhances interpretability and offers a promising path for clinically deployable AI tools in radiology and pathology.

Abstract

Solving tough clinical questions that require both image and text understanding is still a major challenge in healthcare AI. In this work, we propose Q-FSRU, a new model that combines Frequency Spectrum Representation and Fusion (FSRU) with a method called Quantum Retrieval-Augmented Generation (Quantum RAG) for medical Visual Question Answering (VQA). The model takes in features from medical images and related text, then shifts them into the frequency domain using Fast Fourier Transform (FFT). This helps it focus on more meaningful data and filter out noise or less useful information. To improve accuracy and ensure that answers are based on real knowledge, we add a quantum inspired retrieval system. It fetches useful medical facts from external sources using quantum-based similarity techniques. These details are then merged with the frequency-based features for stronger reasoning. We evaluated our model using the VQA-RAD dataset, which includes real radiology images and questions. The results showed that Q-FSRU outperforms earlier models, especially on complex cases needing image text reasoning. The mix of frequency and quantum information improves both performance and explainability. Overall, this approach offers a promising way to build smart, clear, and helpful AI tools for doctors.

Paper Structure

This paper contains 36 sections, 19 equations, 2 figures, 3 tables.

Figures (2)

  • Figure 1: The architecture of the proposed Q-FSRU model for Medical Visual Question Answering. It integrates four main components: multimodal feature extraction, frequency-domain enhancement via FFT, quantum-inspired knowledge retrieval, and multimodal fusion with contrastive learning. Together, these modules enable effective reasoning over medical images and clinical questions.
  • Figure 2: Frequency spectrograms of input medical image and text features. The spectra highlight the main frequency components that are later processed with learnable filter banks.