Table of Contents
Fetching ...

SP-Det: Self-Prompted Dual-Text Fusion for Generalized Multi-Label Lesion Detection

Qing Xu, Yanqian Wang, Xiangjian Hea, Yue Li, Yixuan Zhang, Rong Qu, Wenting Duan, Zhen Chen

TL;DR

This work tackles the bottleneck of expert-annotated prompts in multimodal chest X-ray lesion detection by introducing SP-Det, a self-prompting framework that generates two complementary textual prompts from images using a medical vision-language model. A bidirectional feature enhancer then fuses semantic-context and disease-beacon prompts with visual features via cross-modal attention and dimensional-preserving integration, improving localization and labeling across multiple thoracic diseases. The model is trained with a unified objective that combines contrastive region-text alignment and standard detection losses, and it demonstrates superior performance over state-of-the-art methods on VinDr-CXR and favorable generalization to ChestX-ray8 in zero-shot settings. These results suggest that fully automated prompt generation, grounded in medical VLMs, can provide scalable, annotation-free multimodal lesion detection with strong clinical applicability.

Abstract

Automated lesion detection in chest X-rays has demonstrated significant potential for improving clinical diagnosis by precisely localizing pathological abnormalities. While recent promptable detection frameworks have achieved remarkable accuracy in target localization, existing methods typically rely on manual annotations as prompts, which are labor-intensive and impractical for clinical applications. To address this limitation, we propose SP-Det, a novel self-prompted detection framework that automatically generates rich textual context to guide multi-label lesion detection without requiring expert annotations. Specifically, we introduce an expert-free dual-text prompt generator (DTPG) that leverages two complementary textual modalities: semantic context prompts that capture global pathological patterns and disease beacon prompts that focus on disease-specific manifestations. Moreover, we devise a bidirectional feature enhancer (BFE) that synergistically integrates comprehensive diagnostic context with disease-specific embeddings to significantly improve feature representation and detection accuracy. Extensive experiments on two chest X-ray datasets with diverse thoracic disease categories demonstrate that our SP-Det framework outperforms state-of-the-art detection methods while completely eliminating the dependency on expert-annotated prompts compared to existing promptable architectures.

SP-Det: Self-Prompted Dual-Text Fusion for Generalized Multi-Label Lesion Detection

TL;DR

This work tackles the bottleneck of expert-annotated prompts in multimodal chest X-ray lesion detection by introducing SP-Det, a self-prompting framework that generates two complementary textual prompts from images using a medical vision-language model. A bidirectional feature enhancer then fuses semantic-context and disease-beacon prompts with visual features via cross-modal attention and dimensional-preserving integration, improving localization and labeling across multiple thoracic diseases. The model is trained with a unified objective that combines contrastive region-text alignment and standard detection losses, and it demonstrates superior performance over state-of-the-art methods on VinDr-CXR and favorable generalization to ChestX-ray8 in zero-shot settings. These results suggest that fully automated prompt generation, grounded in medical VLMs, can provide scalable, annotation-free multimodal lesion detection with strong clinical applicability.

Abstract

Automated lesion detection in chest X-rays has demonstrated significant potential for improving clinical diagnosis by precisely localizing pathological abnormalities. While recent promptable detection frameworks have achieved remarkable accuracy in target localization, existing methods typically rely on manual annotations as prompts, which are labor-intensive and impractical for clinical applications. To address this limitation, we propose SP-Det, a novel self-prompted detection framework that automatically generates rich textual context to guide multi-label lesion detection without requiring expert annotations. Specifically, we introduce an expert-free dual-text prompt generator (DTPG) that leverages two complementary textual modalities: semantic context prompts that capture global pathological patterns and disease beacon prompts that focus on disease-specific manifestations. Moreover, we devise a bidirectional feature enhancer (BFE) that synergistically integrates comprehensive diagnostic context with disease-specific embeddings to significantly improve feature representation and detection accuracy. Extensive experiments on two chest X-ray datasets with diverse thoracic disease categories demonstrate that our SP-Det framework outperforms state-of-the-art detection methods while completely eliminating the dependency on expert-annotated prompts compared to existing promptable architectures.

Paper Structure

This paper contains 27 sections, 14 equations, 3 figures, 6 tables.

Figures (3)

  • Figure 1: Comparison of our self-prompted detection with existing lesion detection paradigms in chest X-rays. (1) Non-prompted detection: Traditional models using only image-label pairs without expert knowledge. (2) Expert-prompted detection: Multimodal models requiring manually curated disease categories or clinical reports. (3) Self-prompted detection (SP-Det): Our approach automatically generates semantic context and disease beacons from CXR images using a medical vision-language model (VLM), eliminating manual expert annotation requirements.
  • Figure 2: Overview of the SP-Det framework for multi-label lesion detection. The framework comprises four two components: (1) Expert-free dual-text prompt generator: A medical vision-language model automatically generates clinically relevant reports from CXR images to provide semantic context prompts, while disease categories extracted from these reports serve as disease beacon prompts. (2) Bidirectional feature enhancer: The process begins with self-attention enhancement of the highest-level image features $X_h$, followed by bidirectional cross-attention with text features where each modality alternately functions as queries (Q) and keys/values (K, V). Finally, these refined high-level features are channel-wise concatenated with the original low-level image features $X_l$, ensuring both semantic understanding and spatial detail preservation.
  • Figure 3: Case study from the VinDr-CXR test set. The predicted bounding boxes are compared with the ground truth annotations, with each disease highlighted using a distinct colour, while the ground truth is marked with a red box.