VisionLLM-based Multimodal Fusion Network for Glottic Carcinoma Early Detection
Zhaohui Jin, Yi Shuai, Yongcheng Li, Lingcong Cai, Yun Li, Huifen Liu, Xiaomao Fan
TL;DR
This work addresses the challenge of early glottic carcinoma detection, which is hampered by similarities to vocal cord dysplasia. It introduces MMGC-Net, a VisionLLM-based multimodal fusion network that combines a laryngoscopic image encoder (a BLIP-2-aligned pipeline with Q-Former) and a clinical report encoder (Llama3) through a laryngeal feature fusion block to produce a joint vision-language representation for classification. The authors present SYSU1H, a private dataset of 5,799 image-text pairs, and demonstrate state-of-the-art performance on this dataset, with clear gains over single-modality baselines and other multimodal models. The results suggest that leveraging both imaging and textual clinical information enhances detection robustness and accuracy, with potential for improved clinical decision support in early glottic carcinoma screening.
Abstract
The early detection of glottic carcinoma is critical for improving patient outcomes, as it enables timely intervention, preserves vocal function, and significantly reduces the risk of tumor progression and metastasis. However, the similarity in morphology between glottic carcinoma and vocal cord dysplasia results in suboptimal detection accuracy. To address this issue, we propose a vision large language model-based (VisionLLM-based) multimodal fusion network for glottic carcinoma detection, known as MMGC-Net. By integrating image and text modalities, multimodal models can capture complementary information, leading to more accurate and robust predictions. In this paper, we collect a private real glottic carcinoma dataset named SYSU1H from the First Affiliated Hospital of Sun Yat-sen University, with 5,799 image-text pairs. We leverage an image encoder and additional Q-Former to extract vision embeddings and the Large Language Model Meta AI (Llama3) to obtain text embeddings. These modalities are then integrated through a laryngeal feature fusion block, enabling a comprehensive integration of image and text features, thereby improving the glottic carcinoma identification performance. Extensive experiments on the SYSU1H dataset demonstrate that MMGC-Net can achieve state-of-the-art performance, which is superior to previous multimodal models.
