Table of Contents
Fetching ...

Tri-VQA: Triangular Reasoning Medical Visual Question Answering for Multi-Attribute Analysis

Lin Fan, Xun Gong, Cenyang Zheng, Yafei Ou

TL;DR

A novel Triangular Reasoning VQA (Tri-VQA) framework is proposed, which constructs reverse causal questions from the perspective of "Why this answer?" to elucidate the source of the answer and stimulate more reasonable forward reasoning processes.

Abstract

The intersection of medical Visual Question Answering (Med-VQA) is a challenging research topic with advantages including patient engagement and clinical expert involvement for second opinions. However, existing Med-VQA methods based on joint embedding fail to explain whether their provided results are based on correct reasoning or coincidental answers, which undermines the credibility of VQA answers. In this paper, we investigate the construction of a more cohesive and stable Med-VQA structure. Motivated by causal effect, we propose a novel Triangular Reasoning VQA (Tri-VQA) framework, which constructs reverse causal questions from the perspective of "Why this answer?" to elucidate the source of the answer and stimulate more reasonable forward reasoning processes. We evaluate our method on the Endoscopic Ultrasound (EUS) multi-attribute annotated dataset from five centers, and test it on medical VQA datasets. Experimental results demonstrate the superiority of our approach over existing methods. Our codes and pre-trained models are available at https://anonymous.4open.science/r/Tri_VQA.

Tri-VQA: Triangular Reasoning Medical Visual Question Answering for Multi-Attribute Analysis

TL;DR

A novel Triangular Reasoning VQA (Tri-VQA) framework is proposed, which constructs reverse causal questions from the perspective of "Why this answer?" to elucidate the source of the answer and stimulate more reasonable forward reasoning processes.

Abstract

The intersection of medical Visual Question Answering (Med-VQA) is a challenging research topic with advantages including patient engagement and clinical expert involvement for second opinions. However, existing Med-VQA methods based on joint embedding fail to explain whether their provided results are based on correct reasoning or coincidental answers, which undermines the credibility of VQA answers. In this paper, we investigate the construction of a more cohesive and stable Med-VQA structure. Motivated by causal effect, we propose a novel Triangular Reasoning VQA (Tri-VQA) framework, which constructs reverse causal questions from the perspective of "Why this answer?" to elucidate the source of the answer and stimulate more reasonable forward reasoning processes. We evaluate our method on the Endoscopic Ultrasound (EUS) multi-attribute annotated dataset from five centers, and test it on medical VQA datasets. Experimental results demonstrate the superiority of our approach over existing methods. Our codes and pre-trained models are available at https://anonymous.4open.science/r/Tri_VQA.
Paper Structure (16 sections, 5 equations, 4 figures, 6 tables)

This paper contains 16 sections, 5 equations, 4 figures, 6 tables.

Figures (4)

  • Figure 1: Joint Embedding VQA framework vs. Triangular Reasoning VQA framework. Tri-VQA utilizes mutual inference constraints among v (visual), q (question), and a (answer) to explain the rationality of generated answers.
  • Figure 2: The overall framework of Tri-VQA. We perform forward inference using two input information sources (represented by orange arrows) to obtain the inference for the answer. Subsequently, we utilize the predicted answer to perform backward inference for image features or question features (represented by green and blue arrows, respectively). The predicted features are constrained by a similarity constraint with the ground truth features. Both sets of features generated through backward inference are then fed into the forward inference function ${F}$ to infer the final answer, which is further constrained by the true label.
  • Figure 3: Tri-VQA Component Ablation Experiment.
  • Figure 4: The change in similarity measurement metrics during training.