Table of Contents
Fetching ...

Exploring visual language models as a powerful tool in the diagnosis of Ewing Sarcoma

Alvaro Pastor-Naranjo, Pablo Meseguer, Rocío del Amor, Jose Antonio Lopez-Guerrero, Samuel Navarro, Katia Scotlandi, Antonio Llombart-Bosch, Isidro Machado, Valery Naranjo

TL;DR

This work targets multiclass diagnosis of Ewing's sarcoma from histopathology by framing it as a multiple instance learning problem on tissue microarrays. It compares vision-language pretraining (PLIP) against traditional CNN features, using a transformer-based MIL aggregator (TransMIL) to produce core-level predictions from patch embeddings. The results show that in-domain PLIP features, when paired with TransMIL, yield higher diagnostic accuracy and substantially fewer trainable parameters than fully supervised CNN baselines, with strong performance on ES versus other sarcomas. This approach offers a data-efficient, cost-effective path for accurate histopathology-based sarcoma classification and may generalize to broader diagnostic tasks in digital pathology.

Abstract

Ewing's sarcoma (ES), characterized by a high density of small round blue cells without structural organization, presents a significant health concern, particularly among adolescents aged 10 to 19. Artificial intelligence-based systems for automated analysis of histopathological images are promising to contribute to an accurate diagnosis of ES. In this context, this study explores the feature extraction ability of different pre-training strategies for distinguishing ES from other soft tissue or bone sarcomas with similar morphology in digitized tissue microarrays for the first time, as far as we know. Vision-language supervision (VLS) is compared to fully-supervised ImageNet pre-training within a multiple instance learning paradigm. Our findings indicate a substantial improvement in diagnostic accuracy with the adaption of VLS using an in-domain dataset. Notably, these models not only enhance the accuracy of predicted classes but also drastically reduce the number of trainable parameters and computational costs.

Exploring visual language models as a powerful tool in the diagnosis of Ewing Sarcoma

TL;DR

This work targets multiclass diagnosis of Ewing's sarcoma from histopathology by framing it as a multiple instance learning problem on tissue microarrays. It compares vision-language pretraining (PLIP) against traditional CNN features, using a transformer-based MIL aggregator (TransMIL) to produce core-level predictions from patch embeddings. The results show that in-domain PLIP features, when paired with TransMIL, yield higher diagnostic accuracy and substantially fewer trainable parameters than fully supervised CNN baselines, with strong performance on ES versus other sarcomas. This approach offers a data-efficient, cost-effective path for accurate histopathology-based sarcoma classification and may generalize to broader diagnostic tasks in digital pathology.

Abstract

Ewing's sarcoma (ES), characterized by a high density of small round blue cells without structural organization, presents a significant health concern, particularly among adolescents aged 10 to 19. Artificial intelligence-based systems for automated analysis of histopathological images are promising to contribute to an accurate diagnosis of ES. In this context, this study explores the feature extraction ability of different pre-training strategies for distinguishing ES from other soft tissue or bone sarcomas with similar morphology in digitized tissue microarrays for the first time, as far as we know. Vision-language supervision (VLS) is compared to fully-supervised ImageNet pre-training within a multiple instance learning paradigm. Our findings indicate a substantial improvement in diagnostic accuracy with the adaption of VLS using an in-domain dataset. Notably, these models not only enhance the accuracy of predicted classes but also drastically reduce the number of trainable parameters and computational costs.
Paper Structure (15 sections, 1 equation, 5 figures, 2 tables)

This paper contains 15 sections, 1 equation, 5 figures, 2 tables.

Figures (5)

  • Figure 1: (a) Extraction process of cores from digitized microarrays (TMAs); (b) chondrosarcoma (COND), (c) Ewing's sarcoma, (d) gastrointestinal stroma (GIST) and (e) Rhabdomyosarcoma (RHABDO).
  • Figure 2: Method overview. Multi-class classification of different sarcomas under a multiple instance learning paradigm. In A., feature extraction is described using a CNN-based model, whereas in B., it is based on visual language models.
  • Figure 3: TSNE visualization of features embeddings. Core embedding representation from patch features extracted with VGG16 (a) and PLIP (b) for the five neoplasms under study. Note that the core embedding in both cases was calculated by averaging all the patch embedding.
  • Figure 4: Comparison of different embedding aggregators at the core level in the validation set.
  • Figure 5: Confusion matrix using the test set. (a). VGG16 trained using BGAP as the embedding aggregator. (b) VGG16 frozen using BGAP as embedding aggregation. (c) PLIP frozen using TransMIL as embedding aggregation.