Table of Contents
Fetching ...

ChEX: Interactive Localization and Region Description in Chest X-rays

Philip Müller, Georgios Kaissis, Daniel Rueckert

TL;DR

<3-5 sentence high-level summary>ChEX addresses the lack of interactivity and localized interpretability in chest X-ray report generation by introducing a multitask architecture that jointly handles textual prompts and bounding boxes. The model integrates a ViT-based image encoder, a frozen CLIP-based prompt encoder, a DETR-style prompt detector, and a GPT-2 language model with P-tuning v2 to produce region-specific descriptions, scalable to zero-shot inference. Trained on multi-source data (MIMIC-CXR, VinDr-CXR, CIG, NIH8, MS-CXR), ChEX is evaluated across nine tasks, achieving competitive performance with state-of-the-art baselines while offering strong interactive prompting and interpretable, region-grounded outputs. These capabilities advance clinical applicability by enabling radiologist-guided, transparent, and customizable chest X-ray interpretation pipelines.

Abstract

Report generation models offer fine-grained textual interpretations of medical images like chest X-rays, yet they often lack interactivity (i.e. the ability to steer the generation process through user queries) and localized interpretability (i.e. visually grounding their predictions), which we deem essential for future adoption in clinical practice. While there have been efforts to tackle these issues, they are either limited in their interactivity by not supporting textual queries or fail to also offer localized interpretability. Therefore, we propose a novel multitask architecture and training paradigm integrating textual prompts and bounding boxes for diverse aspects like anatomical regions and pathologies. We call this approach the Chest X-Ray Explainer (ChEX). Evaluations across a heterogeneous set of 9 chest X-ray tasks, including localized image interpretation and report generation, showcase its competitiveness with SOTA models while additional analysis demonstrates ChEX's interactive capabilities. Code: https://github.com/philip-mueller/chex

ChEX: Interactive Localization and Region Description in Chest X-rays

TL;DR

<3-5 sentence high-level summary>ChEX addresses the lack of interactivity and localized interpretability in chest X-ray report generation by introducing a multitask architecture that jointly handles textual prompts and bounding boxes. The model integrates a ViT-based image encoder, a frozen CLIP-based prompt encoder, a DETR-style prompt detector, and a GPT-2 language model with P-tuning v2 to produce region-specific descriptions, scalable to zero-shot inference. Trained on multi-source data (MIMIC-CXR, VinDr-CXR, CIG, NIH8, MS-CXR), ChEX is evaluated across nine tasks, achieving competitive performance with state-of-the-art baselines while offering strong interactive prompting and interpretable, region-grounded outputs. These capabilities advance clinical applicability by enabling radiologist-guided, transparent, and customizable chest X-ray interpretation pipelines.

Abstract

Report generation models offer fine-grained textual interpretations of medical images like chest X-rays, yet they often lack interactivity (i.e. the ability to steer the generation process through user queries) and localized interpretability (i.e. visually grounding their predictions), which we deem essential for future adoption in clinical practice. While there have been efforts to tackle these issues, they are either limited in their interactivity by not supporting textual queries or fail to also offer localized interpretability. Therefore, we propose a novel multitask architecture and training paradigm integrating textual prompts and bounding boxes for diverse aspects like anatomical regions and pathologies. We call this approach the Chest X-Ray Explainer (ChEX). Evaluations across a heterogeneous set of 9 chest X-ray tasks, including localized image interpretation and report generation, showcase its competitiveness with SOTA models while additional analysis demonstrates ChEX's interactive capabilities. Code: https://github.com/philip-mueller/chex
Paper Structure (92 sections, 11 equations, 16 figures, 14 tables)

This paper contains 92 sections, 11 equations, 16 figures, 14 tables.

Figures (16)

  • Figure 1: Overview of ChEX. Given a chest X-ray and a user query, either as a textual prompt (e.g., a pathology name, an anatomical region, or both) or as a bounding box, the model predicts a textual description of the queried region or aspect. For textual user prompts, it additionally predicts relevant bounding boxes. Thus, ChEX facilitates the interactive interpretation of chest X-rays while providing (localized) interpretability.
  • Figure 2: Architecture of ChEX. The DETR-style prompt detector predicts bounding boxes and features for ROIs based on prompt tokens (textual prompts encoded by the prompt encoder) and patch features (from the image encoder). The sentence generator is then used to predict textual descriptions for each ROI independently.
  • Figure 3: Comparison of ChEX with specialized SOTA and common multitask models on 9 chest X-ray tasks, including sentence grounding (SG), pathology detection (OD), region classification (RC), region explanation (RE), and full report generation (RG). ChEX shows excellent performance on this wide range of tasks while none of the baselines is capable of even performing all of them. To improve readability, values are scaled relative to the results of ChEX.
  • Figure 4: Effect of interactive prompting for multiregion disambiguation. In samples with the presence of the same pathology (e.g., pleural effusion) in both lungs, using no regional hints in the textual query ("pleural effusion") detects both pathology instances equally well while adding a course regional hint ("pleural effusion in the right lung") or a fine regional hint ("pleural effusion in the right lower lung") steers the models towards selecting the queried pathology instance.
  • Figure 5: Effect of interactive prompting with regional hints to negative regions. In samples with a pathology in only one of the lungs (e.g., lung opacity in the right lung), using no regional hint ("lung opacity") detects the pathology mostly correctly. Adding the correct regional hint ("lung opacity in the right lung") improves the localization while the regional hint for the opposite lung ("lung opacity in the left lung") steers the model towards the queried anatomical region (away from the pathology), as expected.
  • ...and 11 more figures