Table of Contents
Fetching ...

Location-Aware Pretraining for Medical Difference Visual Question Answering

Denis Musinguzi, Caren Han, Prasenjit Mitra

TL;DR

A pretraining framework is introduced that incorporates location-aware tasks, including automatic referring expressions (AREF), grounded captioning (GCAP), and conditional automatic referring expressions (CAREF) that enable the vision encoder to learn fine-grained, spatially grounded visual representations that are often overlooked by traditional pre-training methods.

Abstract

Unlike conventional single-image models, differential medical VQA frameworks process multiple images to identify differences, mirroring the comparative diagnostic workflow of radiologists. However, standard vision encoders trained on contrastive or classification objectives often fail to capture the subtle visual variations necessary for distinguishing disease progression from acquisition differences. To address this limitation, we introduce a pretraining framework that incorporates location-aware tasks, including automatic referring expressions (AREF), grounded captioning (GCAP), and conditional automatic referring expressions (CAREF). These specific tasks enable the vision encoder to learn fine-grained, spatially grounded visual representations that are often overlooked by traditional pre-training methods. We subsequently integrate this enhanced vision encoder with a language model to perform medical difference VQA. Experimental results demonstrate that our approach achieves state-of-the-art performance in detecting and reasoning about clinically relevant changes in chest X-ray images.

Location-Aware Pretraining for Medical Difference Visual Question Answering

TL;DR

A pretraining framework is introduced that incorporates location-aware tasks, including automatic referring expressions (AREF), grounded captioning (GCAP), and conditional automatic referring expressions (CAREF) that enable the vision encoder to learn fine-grained, spatially grounded visual representations that are often overlooked by traditional pre-training methods.

Abstract

Unlike conventional single-image models, differential medical VQA frameworks process multiple images to identify differences, mirroring the comparative diagnostic workflow of radiologists. However, standard vision encoders trained on contrastive or classification objectives often fail to capture the subtle visual variations necessary for distinguishing disease progression from acquisition differences. To address this limitation, we introduce a pretraining framework that incorporates location-aware tasks, including automatic referring expressions (AREF), grounded captioning (GCAP), and conditional automatic referring expressions (CAREF). These specific tasks enable the vision encoder to learn fine-grained, spatially grounded visual representations that are often overlooked by traditional pre-training methods. We subsequently integrate this enhanced vision encoder with a language model to perform medical difference VQA. Experimental results demonstrate that our approach achieves state-of-the-art performance in detecting and reasoning about clinically relevant changes in chest X-ray images.
Paper Structure (29 sections, 7 equations, 2 figures, 6 tables)

This paper contains 29 sections, 7 equations, 2 figures, 6 tables.

Figures (2)

  • Figure 1: Overview of pretraining model architecture. The model consists of Siglip vision encoder and a transformer decoder. The vision encoder takes a chest X-ray image as input, produces visual tokens as cross attention input to the transformer decoder. We adopt the following four tasks for pretraining: Cap, AREF, GCAP and CAREF.
  • Figure 2: Overview of medical difference model architecture. The model consists of a frozen pretrained vision encoder, a vision adapter and a transformer decoder.