Table of Contents
Fetching ...

MMLNB: Multi-Modal Learning for Neuroblastoma Subtyping Classification Assisted with Textual Description Generation

Huangwei Chen, Yifei Chen, Zhenyu Yan, Mingyang Ding, Chenlei Li, Zhu Zhu, Feiwei Qin

TL;DR

NB pathology remains challenging due to heterogeneity and observer variability. The authors propose MMLNB, a two-stage multimodal framework that fine-tunes a Vision-Language Model for pathology-aware text generation and then fuses VGG16 visual features with BERT-encoded text via the PRMF mechanism for NB subtype classification. On private NBPath-7.5K and NBITP-1.5K data, MMLNB achieves state-of-the-art accuracy and AUROC, with ablations confirming the value of multi-modal fusion, LoRA-based fine-tuning, and noise-robust fusion. The approach enhances interpretability and scalability in digital pathology for NB subtyping and provides a pathway toward more reliable, AI-assisted clinical workflows.

Abstract

Neuroblastoma (NB), a leading cause of childhood cancer mortality, exhibits significant histopathological variability, necessitating precise subtyping for accurate prognosis and treatment. Traditional diagnostic methods rely on subjective evaluations that are time-consuming and inconsistent. To address these challenges, we introduce MMLNB, a multi-modal learning (MML) model that integrates pathological images with generated textual descriptions to improve classification accuracy and interpretability. The approach follows a two-stage process. First, we fine-tune a Vision-Language Model (VLM) to enhance pathology-aware text generation. Second, the fine-tuned VLM generates textual descriptions, using a dual-branch architecture to independently extract visual and textual features. These features are fused via Progressive Robust Multi-Modal Fusion (PRMF) Block for stable training. Experimental results show that the MMLNB model is more accurate than the single modal model. Ablation studies demonstrate the importance of multi-modal fusion, fine-tuning, and the PRMF mechanism. This research creates a scalable AI-driven framework for digital pathology, enhancing reliability and interpretability in NB subtyping classification. Our source code is available at https://github.com/HovChen/MMLNB.

MMLNB: Multi-Modal Learning for Neuroblastoma Subtyping Classification Assisted with Textual Description Generation

TL;DR

NB pathology remains challenging due to heterogeneity and observer variability. The authors propose MMLNB, a two-stage multimodal framework that fine-tunes a Vision-Language Model for pathology-aware text generation and then fuses VGG16 visual features with BERT-encoded text via the PRMF mechanism for NB subtype classification. On private NBPath-7.5K and NBITP-1.5K data, MMLNB achieves state-of-the-art accuracy and AUROC, with ablations confirming the value of multi-modal fusion, LoRA-based fine-tuning, and noise-robust fusion. The approach enhances interpretability and scalability in digital pathology for NB subtyping and provides a pathway toward more reliable, AI-assisted clinical workflows.

Abstract

Neuroblastoma (NB), a leading cause of childhood cancer mortality, exhibits significant histopathological variability, necessitating precise subtyping for accurate prognosis and treatment. Traditional diagnostic methods rely on subjective evaluations that are time-consuming and inconsistent. To address these challenges, we introduce MMLNB, a multi-modal learning (MML) model that integrates pathological images with generated textual descriptions to improve classification accuracy and interpretability. The approach follows a two-stage process. First, we fine-tune a Vision-Language Model (VLM) to enhance pathology-aware text generation. Second, the fine-tuned VLM generates textual descriptions, using a dual-branch architecture to independently extract visual and textual features. These features are fused via Progressive Robust Multi-Modal Fusion (PRMF) Block for stable training. Experimental results show that the MMLNB model is more accurate than the single modal model. Ablation studies demonstrate the importance of multi-modal fusion, fine-tuning, and the PRMF mechanism. This research creates a scalable AI-driven framework for digital pathology, enhancing reliability and interpretability in NB subtyping classification. Our source code is available at https://github.com/HovChen/MMLNB.

Paper Structure

This paper contains 23 sections, 17 equations, 7 figures, 2 tables.

Figures (7)

  • Figure 1: The overall architecture of our proposed MMLNB model. The model consists of two stages: (a) fine-tuning a VLM via LoRA; (b) applying the fine-tuned VLM to generate pathology-aware textual descriptions to assist image classification.
  • Figure 2: The prompt for VLM. The structured prompt is designed to elicit precise pathological insights from Qwen2.5-VL.
  • Figure 3: The illustration of PRMF Block. The PRMF Block balances image and text features with a confidence-weighted fusion mechanism, reducing noisy textual impact. A curriculum learning strategy ensures a stable trainng.
  • Figure 4: Visual representation of NBPath-7.5K and NBITP-1.5K datasets. The dataset contains three subtypes of NB, namely UD, PD, and D.
  • Figure 5: The data processing workflow of the private NBPath-7.5K dataset. The process consists of three key stages: (1) Data Collection, involving the acquisition of WSIs and selection based on specific criteria; (2) Image Preprocessing, including structured sampling and patch extraction; (3) Data Organization, where images are labeled, categorized, and stored for further model training and evaluation.
  • ...and 2 more figures