Table of Contents
Fetching ...

Abnormality-Driven Representation Learning for Radiology Imaging

Marta Ligero, Tim Lenz, Georg Wölflein, Omar S. M. El Nahhas, Daniel Truhn, Jakob Nikolas Kather

TL;DR

CLEAR is proposed, a framework for radiology images that uses extracted embeddings from 2D slices along with attention-based aggregation for efficiently predicting clinical endpoints, and introduces lesion-enhanced contrastive learning (LeCL), a novel approach to obtain visual representations driven by abnormalities in 2D axial slices across different locations of the CT scans.

Abstract

To date, the most common approach for radiology deep learning pipelines is the use of end-to-end 3D networks based on models pre-trained on other tasks, followed by fine-tuning on the task at hand. In contrast, adjacent medical fields such as pathology, which focus on 2D images, have effectively adopted task-agnostic foundational models based on self-supervised learning (SSL), combined with weakly-supervised deep learning (DL). However, the field of radiology still lacks task-agnostic representation models due to the computational and data demands of 3D imaging and the anatomical complexity inherent to radiology scans. To address this gap, we propose CLEAR, a framework for radiology images that uses extracted embeddings from 2D slices along with attention-based aggregation for efficiently predicting clinical endpoints. As part of this framework, we introduce lesion-enhanced contrastive learning (LeCL), a novel approach to obtain visual representations driven by abnormalities in 2D axial slices across different locations of the CT scans. Specifically, we trained single-domain contrastive learning approaches using three different architectures: Vision Transformers, Vision State Space Models and Gated Convolutional Neural Networks. We evaluate our approach across three clinical tasks: tumor lesion location, lung disease detection, and patient staging, benchmarking against four state-of-the-art foundation models, including BiomedCLIP. Our findings demonstrate that CLEAR using representations learned through LeCL, outperforms existing foundation models, while being substantially more compute- and data-efficient.

Abnormality-Driven Representation Learning for Radiology Imaging

TL;DR

CLEAR is proposed, a framework for radiology images that uses extracted embeddings from 2D slices along with attention-based aggregation for efficiently predicting clinical endpoints, and introduces lesion-enhanced contrastive learning (LeCL), a novel approach to obtain visual representations driven by abnormalities in 2D axial slices across different locations of the CT scans.

Abstract

To date, the most common approach for radiology deep learning pipelines is the use of end-to-end 3D networks based on models pre-trained on other tasks, followed by fine-tuning on the task at hand. In contrast, adjacent medical fields such as pathology, which focus on 2D images, have effectively adopted task-agnostic foundational models based on self-supervised learning (SSL), combined with weakly-supervised deep learning (DL). However, the field of radiology still lacks task-agnostic representation models due to the computational and data demands of 3D imaging and the anatomical complexity inherent to radiology scans. To address this gap, we propose CLEAR, a framework for radiology images that uses extracted embeddings from 2D slices along with attention-based aggregation for efficiently predicting clinical endpoints. As part of this framework, we introduce lesion-enhanced contrastive learning (LeCL), a novel approach to obtain visual representations driven by abnormalities in 2D axial slices across different locations of the CT scans. Specifically, we trained single-domain contrastive learning approaches using three different architectures: Vision Transformers, Vision State Space Models and Gated Convolutional Neural Networks. We evaluate our approach across three clinical tasks: tumor lesion location, lung disease detection, and patient staging, benchmarking against four state-of-the-art foundation models, including BiomedCLIP. Our findings demonstrate that CLEAR using representations learned through LeCL, outperforms existing foundation models, while being substantially more compute- and data-efficient.

Paper Structure

This paper contains 28 sections, 6 equations, 6 figures, 9 tables.

Figures (6)

  • Figure 1: Overview of the proposed framework Clear. Currently, end-to-end deep learning approaches in radiology mostly fine-tune the encoder for each specific task separately (A). We propose a weakly supervised pipeline that deploys a pretrained encoder to extract frozen embeddings, which are used for supervised training of an attention-based pooling model (B). For pretraining the feature extractor, we propose LeCL, a semi-supervised algorithm that guarantees that the abnormalities are within the crops of the images (C).
  • Figure 1: Attention distribution across different slices: We evaluated the attention distribution across slices in a patient with lung and mediastinum lesions for BiomedCLIP (A), MambaOut architecture trained using MoCo (B) and MambaOut architecture using LeCL approach for $\lambda$ = 0 (C). Blue represents the attention for the slices processed in abdominal window images (D) and red represents the slices processed in lung window (E).
  • Figure 2: Downstream task evaluation for Clear. (A) We evaluated our approach in three downstream task including lesion detection (Task 1), chest abnormality classification (Task 2 and Patient staging (Task 3). (B) We compare between 2D and 3D encoders as feature extractor to evaluate the Clear framework for multi-task multi-label classification (Task 1 and 2) and binary classification (Task 3). (C) For explainability purposes, we explore the distribution of attention scores across CT slices
  • Figure 3: Model architectures characteristics: Number of parameters and model classification based on Multi-modal vs. Unimodal Foundation models and Domain-specific vs General-purpose (see \ref{['fig:params']}). We reported characteristics for the different existing foundation models (BiomedCLIP, Merlin, SAM, CT-CLIP) and the architectures used for training contrastive learning (VMamba, MambaOut, ViT).
  • Figure 4: Attention distribution across different slices: We evaluated the attention distribution across slices for a patient with liver and soft tissue lesions for BiomedCLIP (A), MambaOut architecture trained using MoCo (B) and MambaOut architecture using Lecl approach for $\lambda$ = 0 (C). Blue represents the attention for the slices processed in abdominal window images (D) and red represents the slices processed in lung window (E). All models show higher attention to the abdominal window where the lesion is better depicted.
  • ...and 1 more figures