Table of Contents
Fetching ...

Semantic Alignment of Unimodal Medical Text and Vision Representations

Maxime Di Folco, Emily Chan, Marta Hasny, Cosmin I. Bercea, Julia A. Schnabel

TL;DR

This work addresses the gap between general-purpose AI models and specialized medical imaging tasks by introducing semantic alignment, which learns an at-most affine transformation $\mathcal{T}$ from semantically matched anchors to stitch models across architectures and modalities. It advances two contributions: (1) demonstrating cross-architecture stitching from general to medical models on chest X-ray data to transfer domain-specific knowledge without retraining, and (2) proposing a unimodal zero-shot classification method that maps image embeddings into the text space via $\mathcal{T}$, enabling comparison with text embeddings for decision making. The results show that semantic alignment can outperform naive stitching and approach domain-specific multimodal performance on text-driven tasks, while the unimodal zero-shot approach provides a scalable alternative to heavy medical multimodal models. Overall, the approach offers a lightweight, flexible pathway to incorporate medical knowledge into general AI systems, with potential for broad applicability in medical imaging tasks.

Abstract

General-purpose AI models, particularly those designed for text and vision, demonstrate impressive versatility across a wide range of deep-learning tasks. However, they often underperform in specialised domains like medical imaging, where domain-specific solutions or alternative knowledge transfer approaches are typically required. Recent studies have noted that general-purpose models can exhibit similar latent spaces when processing semantically related data, although this alignment does not occur naturally. Building on this insight, it has been shown that applying a simple transformation - at most affine - estimated from a subset of semantically corresponding samples, known as anchors, enables model stitching across diverse training paradigms, architectures, and modalities. In this paper, we explore how semantic alignment - estimating transformations between anchors - can bridge general-purpose AI with specialised medical knowledge. Using multiple public chest X-ray datasets, we demonstrate that model stitching across model architectures allows general models to integrate domain-specific knowledge without additional training, leading to improved performance on medical tasks. Furthermore, we introduce a novel zero-shot classification approach for unimodal vision encoders that leverages semantic alignment across modalities. Our results show that our method not only outperforms general multimodal models but also approaches the performance levels of fully trained, medical-specific multimodal solutions

Semantic Alignment of Unimodal Medical Text and Vision Representations

TL;DR

This work addresses the gap between general-purpose AI models and specialized medical imaging tasks by introducing semantic alignment, which learns an at-most affine transformation from semantically matched anchors to stitch models across architectures and modalities. It advances two contributions: (1) demonstrating cross-architecture stitching from general to medical models on chest X-ray data to transfer domain-specific knowledge without retraining, and (2) proposing a unimodal zero-shot classification method that maps image embeddings into the text space via , enabling comparison with text embeddings for decision making. The results show that semantic alignment can outperform naive stitching and approach domain-specific multimodal performance on text-driven tasks, while the unimodal zero-shot approach provides a scalable alternative to heavy medical multimodal models. Overall, the approach offers a lightweight, flexible pathway to incorporate medical knowledge into general AI systems, with potential for broad applicability in medical imaging tasks.

Abstract

General-purpose AI models, particularly those designed for text and vision, demonstrate impressive versatility across a wide range of deep-learning tasks. However, they often underperform in specialised domains like medical imaging, where domain-specific solutions or alternative knowledge transfer approaches are typically required. Recent studies have noted that general-purpose models can exhibit similar latent spaces when processing semantically related data, although this alignment does not occur naturally. Building on this insight, it has been shown that applying a simple transformation - at most affine - estimated from a subset of semantically corresponding samples, known as anchors, enables model stitching across diverse training paradigms, architectures, and modalities. In this paper, we explore how semantic alignment - estimating transformations between anchors - can bridge general-purpose AI with specialised medical knowledge. Using multiple public chest X-ray datasets, we demonstrate that model stitching across model architectures allows general models to integrate domain-specific knowledge without additional training, leading to improved performance on medical tasks. Furthermore, we introduce a novel zero-shot classification approach for unimodal vision encoders that leverages semantic alignment across modalities. Our results show that our method not only outperforms general multimodal models but also approaches the performance levels of fully trained, medical-specific multimodal solutions

Paper Structure

This paper contains 12 sections, 4 figures, 1 table.

Figures (4)

  • Figure 1: Illustration of the semantic alignment process, which enables model stitching between encoders from different architecture or modalities via the estimation on the anchors of a transformation $\mathcal{T}$, that is at most affine.
  • Figure 2: Illustration of the proposed unimodal zero-shot classification method, which transforms image embeddings into the text space and computes similarity scores entirely between text embeddings.
  • Figure 3: Performance comparison of affine, linear, l-ortho, and ortho with varying number of anchors for (a) Text (MIMIC-CXR) and (b) Vison (RSNA) classification tasks. Naive represents model stitching without semantic alignment and upper bound the average performance in a normal setting. Mean and standard deviation are across all possible encoders-decoders combinations and runs.
  • Figure 4: Comparison of the proposed unimodal zero-shot classification against multimodal methods for the RSNA pneumonia dataset and the classes of the VinDr-CXR dataset.