Table of Contents
Fetching ...

CoTBox-TTT: Grounding Medical VQA with Visual Chain-of-Thought Boxes During Test-time Training

Jiahe Qian, Yuhao Shen, Zhangtianyi Chen, Juexiao Zhou, Peisong Wang

TL;DR

CoTBox-TTT introduces an evidence-first test-time training framework for medical visual question answering that freezes all backbones and adapts only two lightweight soft prompts. By grounding answers to explicit visual evidence via bounding boxes and enforcing cross-view consistency between the original image and a localized crop with an EMA teacher, it reduces spurious attention and generation drift without requiring extra labels. The approach is model-agnostic and plug-and-play, delivering consistent gains in open-ended recall and close-ended accuracy across VQA-RAD, SLAKE, and PathVQA benchmarks and various backbones, enabling more reliable deployment in clinical settings. This work demonstrates that lightweight, two-stage adaptation can enhance grounding, stability, and interpretability in medical VQA while maintaining efficiency suitable for real-world use.

Abstract

Medical visual question answering could support clinical decision making, yet current systems often fail under domain shift and produce answers that are weakly grounded in image evidence. This reliability gap arises when models attend to spurious regions and when retraining or additional labels are impractical at deployment time. We address this setting with CoTBox-TTT, an evidence-first test-time training approach that adapts a vision-language model at inference while keeping all backbones frozen. The method updates only a small set of continuous soft prompts. It identifies question-relevant regions through a visual chain-of-thought signal and encourages answer consistency across the original image and a localized crop. The procedure is label free, and plug and play with diverse backbones. Experiments on medical VQA show that the approach is practical for real deployments. For instance, adding CoTBox-TTT to LLaVA increases closed-ended accuracy by 12.3% on pathVQA.

CoTBox-TTT: Grounding Medical VQA with Visual Chain-of-Thought Boxes During Test-time Training

TL;DR

CoTBox-TTT introduces an evidence-first test-time training framework for medical visual question answering that freezes all backbones and adapts only two lightweight soft prompts. By grounding answers to explicit visual evidence via bounding boxes and enforcing cross-view consistency between the original image and a localized crop with an EMA teacher, it reduces spurious attention and generation drift without requiring extra labels. The approach is model-agnostic and plug-and-play, delivering consistent gains in open-ended recall and close-ended accuracy across VQA-RAD, SLAKE, and PathVQA benchmarks and various backbones, enabling more reliable deployment in clinical settings. This work demonstrates that lightweight, two-stage adaptation can enhance grounding, stability, and interpretability in medical VQA while maintaining efficiency suitable for real-world use.

Abstract

Medical visual question answering could support clinical decision making, yet current systems often fail under domain shift and produce answers that are weakly grounded in image evidence. This reliability gap arises when models attend to spurious regions and when retraining or additional labels are impractical at deployment time. We address this setting with CoTBox-TTT, an evidence-first test-time training approach that adapts a vision-language model at inference while keeping all backbones frozen. The method updates only a small set of continuous soft prompts. It identifies question-relevant regions through a visual chain-of-thought signal and encourages answer consistency across the original image and a localized crop. The procedure is label free, and plug and play with diverse backbones. Experiments on medical VQA show that the approach is practical for real deployments. For instance, adding CoTBox-TTT to LLaVA increases closed-ended accuracy by 12.3% on pathVQA.

Paper Structure

This paper contains 16 sections, 19 equations, 2 figures, 3 tables, 1 algorithm.

Figures (2)

  • Figure 1: Overview of CoTBox-TTT. (a) Evidence localization: a grounding model conditioned on visual prompts predicts a bounding box on the original image, crops a localized view, and performs a second pass to validate the evidence. (b) Answer consistency: a vision–language model generates student answers on the original and cropped views while an EMA teacher provides targets on the same two views, and the student is aligned to the teacher across views. (c) Test-time adaptation updates only small soft prompts and keeps encoders and decoders frozen, yielding an interpretable evidence trail and consistent gains across backbones.
  • Figure 2: Qualitative examples of CoTBox-TTT. Each case shows the original image, the crop guided by the grounding model with the predicted bounding box, the model responses from baseline model with and without CoTBox-TTT, and the ground truth label. In each case the baseline model without test-time adaptation produces an answer that is either incomplete or inconsistent with the image content, while CoTBox-TTT first localizes the clinically relevant region and then aligns the answers across the original and cropped views in order to convert ambiguous or incorrect predictions into medically specific and image supported statements.