Table of Contents
Fetching ...

Multi-Task Learning for Visually Grounded Reasoning in Gastrointestinal VQA

Itbaan Safwan, Muhammad Annas Shaikh, Muhammad Haaris, Ramail Khan, Muhammad Atif Tahir

TL;DR

This work tackles explainable GI image understanding by proposing a multi-task framework that jointly trains visual question answering, textual explanations, and visual grounding. Using a LoRA-tuned Florence-2 backbone, the model is trained on three data streams: QA from Kvasir-VQA-x1 with medical reasoning metadata, generated textual explanations, and text-to-region grounding via pseudo-masks and real masks. The multi-task approach yields improved visual grounding and language quality compared to single-task baselines, with the best model achieving BLEU, ROUGE-L, and BERTScore F1 scores on a private dataset (e.g., BLEU ≈ 0.454, ROUGE-L ≈ 0.653, BERTScore F1 ≈ 0.948). Limitations include synthetic explanation diversity and grounding data imbalance; future work suggests refining pseudo-masks with SAM, augmenting datasets, and incorporating expert-grounded explanations to enhance reliability and clinical utility.

Abstract

We present a multi-task framework for the MediaEval Medico 2025 challenge, leveraging a LoRA-tuned Florence-2 model for simultaneous visual question answering (VQA), explanation generation, and visual grounding. The proposed system integrates three curated datasets: (1) Kvasir-VQA-x1 for question-answer learning, (2) a synthetically enriched explanation dataset offering structured medical reasoning, and (3) text-to-region pairs linking visual features with segmentation masks. This multi-task setup enables the model to jointly learn visual grounding, reasoning, and interpretation, producing responses that are both accurate and interpretable. Extensive evaluation demonstrates that our approach substantially improves over single-task baselines in both answer accuracy and visual localization, highlighting the effectiveness of grounded multi-task learning for medical VQA applications.

Multi-Task Learning for Visually Grounded Reasoning in Gastrointestinal VQA

TL;DR

This work tackles explainable GI image understanding by proposing a multi-task framework that jointly trains visual question answering, textual explanations, and visual grounding. Using a LoRA-tuned Florence-2 backbone, the model is trained on three data streams: QA from Kvasir-VQA-x1 with medical reasoning metadata, generated textual explanations, and text-to-region grounding via pseudo-masks and real masks. The multi-task approach yields improved visual grounding and language quality compared to single-task baselines, with the best model achieving BLEU, ROUGE-L, and BERTScore F1 scores on a private dataset (e.g., BLEU ≈ 0.454, ROUGE-L ≈ 0.653, BERTScore F1 ≈ 0.948). Limitations include synthetic explanation diversity and grounding data imbalance; future work suggests refining pseudo-masks with SAM, augmenting datasets, and incorporating expert-grounded explanations to enhance reliability and clinical utility.

Abstract

We present a multi-task framework for the MediaEval Medico 2025 challenge, leveraging a LoRA-tuned Florence-2 model for simultaneous visual question answering (VQA), explanation generation, and visual grounding. The proposed system integrates three curated datasets: (1) Kvasir-VQA-x1 for question-answer learning, (2) a synthetically enriched explanation dataset offering structured medical reasoning, and (3) text-to-region pairs linking visual features with segmentation masks. This multi-task setup enables the model to jointly learn visual grounding, reasoning, and interpretation, producing responses that are both accurate and interpretable. Extensive evaluation demonstrates that our approach substantially improves over single-task baselines in both answer accuracy and visual localization, highlighting the effectiveness of grounded multi-task learning for medical VQA applications.

Paper Structure

This paper contains 12 sections, 6 figures, 2 tables.

Figures (6)

  • Figure 1: Our proposed multi-task training framework.
  • Figure 2: Comparison of model responses before and after multi-task training. Question: What is the size of the polyp? Actual Answer: polyp larger than 20 millimeters.
  • Figure 3: Sub-Task 2: Question-wise radar graph for each explainability metric on official results.
  • Figure 4: Our proposed multi-task training framework.
  • Figure 5: Comparison of model responses before and after multi-task training. Question: What is the size of the polyp? Actual Answer: polyp larger than 20 millimeters.
  • ...and 1 more figures