SP-Det: Self-Prompted Dual-Text Fusion for Generalized Multi-Label Lesion Detection
Qing Xu, Yanqian Wang, Xiangjian Hea, Yue Li, Yixuan Zhang, Rong Qu, Wenting Duan, Zhen Chen
TL;DR
This work tackles the bottleneck of expert-annotated prompts in multimodal chest X-ray lesion detection by introducing SP-Det, a self-prompting framework that generates two complementary textual prompts from images using a medical vision-language model. A bidirectional feature enhancer then fuses semantic-context and disease-beacon prompts with visual features via cross-modal attention and dimensional-preserving integration, improving localization and labeling across multiple thoracic diseases. The model is trained with a unified objective that combines contrastive region-text alignment and standard detection losses, and it demonstrates superior performance over state-of-the-art methods on VinDr-CXR and favorable generalization to ChestX-ray8 in zero-shot settings. These results suggest that fully automated prompt generation, grounded in medical VLMs, can provide scalable, annotation-free multimodal lesion detection with strong clinical applicability.
Abstract
Automated lesion detection in chest X-rays has demonstrated significant potential for improving clinical diagnosis by precisely localizing pathological abnormalities. While recent promptable detection frameworks have achieved remarkable accuracy in target localization, existing methods typically rely on manual annotations as prompts, which are labor-intensive and impractical for clinical applications. To address this limitation, we propose SP-Det, a novel self-prompted detection framework that automatically generates rich textual context to guide multi-label lesion detection without requiring expert annotations. Specifically, we introduce an expert-free dual-text prompt generator (DTPG) that leverages two complementary textual modalities: semantic context prompts that capture global pathological patterns and disease beacon prompts that focus on disease-specific manifestations. Moreover, we devise a bidirectional feature enhancer (BFE) that synergistically integrates comprehensive diagnostic context with disease-specific embeddings to significantly improve feature representation and detection accuracy. Extensive experiments on two chest X-ray datasets with diverse thoracic disease categories demonstrate that our SP-Det framework outperforms state-of-the-art detection methods while completely eliminating the dependency on expert-annotated prompts compared to existing promptable architectures.
