Table of Contents
Fetching ...

LoFi: Location-Aware Fine-Grained Representation Learning for Chest X-ray

Myeongkyun Kang, Yanting Yang, Xiaoxiao Li

Abstract

Fine-grained representation learning is crucial for retrieval and phrase grounding in chest X-rays, where clinically relevant findings are often spatially confined. However, the lack of region-level supervision in contrastive models and the limited ability of large vision language models to capture fine-grained representations in external validation lead to suboptimal performance on these tasks. To address these limitations, we propose Location-aware Fine-grained representation learning (LoFi), which jointly optimizes sigmoid, captioning, and location-aware captioning losses using a lightweight large language model. The location-aware captioning loss enables region-level supervision through grounding and dense captioning objectives, thereby facilitating fine-grained representation learning. Building upon these representations, we integrate a fine-grained encoder into retrieval-based in-context learning to enhance chest X-ray grounding across diverse settings. Extensive experiments demonstrate that our method achieves superior retrieval and phrase grounding performance on MIMIC-CXR and PadChest-GR.

LoFi: Location-Aware Fine-Grained Representation Learning for Chest X-ray

Abstract

Fine-grained representation learning is crucial for retrieval and phrase grounding in chest X-rays, where clinically relevant findings are often spatially confined. However, the lack of region-level supervision in contrastive models and the limited ability of large vision language models to capture fine-grained representations in external validation lead to suboptimal performance on these tasks. To address these limitations, we propose Location-aware Fine-grained representation learning (LoFi), which jointly optimizes sigmoid, captioning, and location-aware captioning losses using a lightweight large language model. The location-aware captioning loss enables region-level supervision through grounding and dense captioning objectives, thereby facilitating fine-grained representation learning. Building upon these representations, we integrate a fine-grained encoder into retrieval-based in-context learning to enhance chest X-ray grounding across diverse settings. Extensive experiments demonstrate that our method achieves superior retrieval and phrase grounding performance on MIMIC-CXR and PadChest-GR.
Paper Structure (15 sections, 3 equations, 5 figures, 4 tables)

This paper contains 15 sections, 3 equations, 5 figures, 4 tables.

Figures (5)

  • Figure 1: Illustration of (a) location-aware fine-grained representation learning frameworks and (b) a retrieval-based in-context learning pipeline.
  • Figure 2: Qualitative results with ground truth (dashed) and predictions (solid).
  • Figure 3: Phrase grounding performance on the PadChest-GR dataset for external validation.
  • Figure 4: Phrase grounding performance on the PadChest-GR dataset for internal validation.
  • Figure 4: (a)-(d) Ablation results of our method using all losses (filled), without $\mathcal{L}_{g,d}$ (hatched), and without both $\mathcal{L}_c$ and $\mathcal{L}_{g,d}$ (dotted). (e) Ablation results for different $\lambda$ values (line graph).