Table of Contents
Fetching ...

Instruction-Guided Lesion Segmentation for Chest X-rays with Automatically Generated Large-Scale Dataset

Geon Choi, Hangyul Yoon, Hyunju Shin, Hyunki Park, Sang Hoon Seo, Eunho Yang, Edward Choi

TL;DR

This work introduces instruction-guided lesion segmentation (ILS) for chest X-rays and presents MIMIC-ILS, the first large-scale, automatically constructed dataset linking lesion masks to natural-language instructions. Trained on MIMIC-ILS, ROSALIA—a vision–language model integrated with Segment Anything Model—achieves robust, instruction-driven lesion segmentation across seven lesion types and provides textual explanations, outperforming a range of baselines. The approach hinges on an automated pipeline that grounds masks by aligning radiology reports with imaging cues, enabling diverse prompts, including absence checks. The contributions offer a scalable resource and a practical model for fine-grained CXR lesion grounding, with potential to streamline radiology workflows and broad accessibility to non-experts.

Abstract

The applicability of current lesion segmentation models for chest X-rays (CXRs) has been limited both by a small number of target labels and the reliance on long, detailed expert-level text inputs, creating a barrier to practical use. To address these limitations, we introduce a new paradigm: instruction-guided lesion segmentation (ILS), which is designed to segment diverse lesion types based on simple, user-friendly instructions. Under this paradigm, we construct MIMIC-ILS, the first large-scale instruction-answer dataset for CXR lesion segmentation, using our fully automated multimodal pipeline that generates annotations from chest X-ray images and their corresponding reports. MIMIC-ILS contains 1.1M instruction-answer pairs derived from 192K images and 91K unique segmentation masks, covering seven major lesion types. To empirically demonstrate its utility, we introduce ROSALIA, a vision-language model fine-tuned on MIMIC-ILS. ROSALIA can segment diverse lesions and provide textual explanations in response to user instructions. The model achieves high segmentation and textual accuracy in our newly proposed task, highlighting the effectiveness of our pipeline and the value of MIMIC-ILS as a foundational resource for pixel-level CXR lesion grounding.

Instruction-Guided Lesion Segmentation for Chest X-rays with Automatically Generated Large-Scale Dataset

TL;DR

This work introduces instruction-guided lesion segmentation (ILS) for chest X-rays and presents MIMIC-ILS, the first large-scale, automatically constructed dataset linking lesion masks to natural-language instructions. Trained on MIMIC-ILS, ROSALIA—a vision–language model integrated with Segment Anything Model—achieves robust, instruction-driven lesion segmentation across seven lesion types and provides textual explanations, outperforming a range of baselines. The approach hinges on an automated pipeline that grounds masks by aligning radiology reports with imaging cues, enabling diverse prompts, including absence checks. The contributions offer a scalable resource and a practical model for fine-grained CXR lesion grounding, with potential to streamline radiology workflows and broad accessibility to non-experts.

Abstract

The applicability of current lesion segmentation models for chest X-rays (CXRs) has been limited both by a small number of target labels and the reliance on long, detailed expert-level text inputs, creating a barrier to practical use. To address these limitations, we introduce a new paradigm: instruction-guided lesion segmentation (ILS), which is designed to segment diverse lesion types based on simple, user-friendly instructions. Under this paradigm, we construct MIMIC-ILS, the first large-scale instruction-answer dataset for CXR lesion segmentation, using our fully automated multimodal pipeline that generates annotations from chest X-ray images and their corresponding reports. MIMIC-ILS contains 1.1M instruction-answer pairs derived from 192K images and 91K unique segmentation masks, covering seven major lesion types. To empirically demonstrate its utility, we introduce ROSALIA, a vision-language model fine-tuned on MIMIC-ILS. ROSALIA can segment diverse lesions and provide textual explanations in response to user instructions. The model achieves high segmentation and textual accuracy in our newly proposed task, highlighting the effectiveness of our pipeline and the value of MIMIC-ILS as a foundational resource for pixel-level CXR lesion grounding.

Paper Structure

This paper contains 39 sections, 3 equations, 13 figures, 13 tables, 1 algorithm.

Figures (13)

  • Figure 1: Examples of the instruction-guided CXR lesion segmentation task. Given text instructions for various lesion types and locations of interest, ROSALIA, a VLM trained on our MIMIC-ILS dataset, can: (A) segment lesions in a specified location, (B) segment lesions globally, and (C) detect empty-target cases. As can be seen in (A), ROSALIA correctly ignores the unrequested lesion in the left lung.
  • Figure 2: An overview of grounded lesion mask generation. (Top-left) Textual information is extracted from the radiology report during the report structuring and location mapping. (Bottom-left and Center) Pretrained vision models are also employed to produce spatial information. (Right) Finally, a lesion mask is generated by integrating this information. The verification step then confirms the grounded location ($l_1$), identifies the empty location ($l_3$) for negative sample generation, and discards the reported-but-ungrounded location ($l_2$).
  • Figure 3: Instruction–answer pair generation process using the example report, “Bibasilar atelectasis. Cardiomegaly.” We utilize the elements extracted from the previous lesion mask generation process (see Fig. \ref{['fig:report_grounding']}), indicated by the dashed box. Structured tuples (A&B in the top left) are converted to text instructions and mapped to their corresponding ground-truth masks and textual descriptions. Invalid instructions for lesions which lack a corresponding mask are excluded (colored as red), and only valid instructions are retained (colored as green). (ET: entity, PS: presence, CT: certainty, RL: reported location, GL: grounded location, EL: empty location)
  • Figure 4: Distribution of MIMIC-ILS dataset. The y-axis indicates the number of samples, and the x-axis represents the lesion type. (CA: cardiomegaly, PN: pneumonia, AT: atelectasis, OP: opacity, CO: consolidation, ED: edema, EF: effusion)
  • Figure 5: Overview of ROSALIA. The architecture integrates a VLM with the SAM. The VLM takes a CXR image and a segmentation instruction as input, generating both a textual description and a special [SEG] token. The hidden embedding of this [SEG] token is then passed to SAM's decoder to produce the final mask.
  • ...and 8 more figures