Table of Contents
Fetching ...

Random Token Fusion for Multi-View Medical Diagnosis

Jingyu Guo, Christos Matsoukas, Fredrik Strand, Kevin Smith

TL;DR

This work introduces Random Token Fusion (RTF), a novel technique designed to enhance multi-view medical image analysis using vision transformers that consistently improves the performance of existing fusion methods, paving the way for a new generation of multi-view medical foundation models.

Abstract

In multi-view medical diagnosis, deep learning-based models often fuse information from different imaging perspectives to improve diagnostic performance. However, existing approaches are prone to overfitting and rely heavily on view-specific features, which can lead to trivial solutions. In this work, we introduce Random Token Fusion (RTF), a novel technique designed to enhance multi-view medical image analysis using vision transformers. By integrating randomness into the feature fusion process during training, RTF addresses the issue of overfitting and enhances the robustness and accuracy of diagnostic models without incurring any additional cost at inference. We validate our approach on standard mammography and chest X-ray benchmark datasets. Through extensive experiments, we demonstrate that RTF consistently improves the performance of existing fusion methods, paving the way for a new generation of multi-view medical foundation models.

Random Token Fusion for Multi-View Medical Diagnosis

TL;DR

This work introduces Random Token Fusion (RTF), a novel technique designed to enhance multi-view medical image analysis using vision transformers that consistently improves the performance of existing fusion methods, paving the way for a new generation of multi-view medical foundation models.

Abstract

In multi-view medical diagnosis, deep learning-based models often fuse information from different imaging perspectives to improve diagnostic performance. However, existing approaches are prone to overfitting and rely heavily on view-specific features, which can lead to trivial solutions. In this work, we introduce Random Token Fusion (RTF), a novel technique designed to enhance multi-view medical image analysis using vision transformers. By integrating randomness into the feature fusion process during training, RTF addresses the issue of overfitting and enhances the robustness and accuracy of diagnostic models without incurring any additional cost at inference. We validate our approach on standard mammography and chest X-ray benchmark datasets. Through extensive experiments, we demonstrate that RTF consistently improves the performance of existing fusion methods, paving the way for a new generation of multi-view medical foundation models.

Paper Structure

This paper contains 16 sections, 6 equations, 4 figures, 3 tables.

Figures (4)

  • Figure 1: Illustration of the overfitting problem in multi-view medical diagnosis. The model's attention becomes overly focused on one of the two available views, resulting in an incomplete interpretation of the case. In this example (top), model attention in the MLO view dominates over the CC view in CBIS-DDSM (top left), and the frontal view over the lateral view in CheXpert (top right). Random Token Fusion (RTF) encourages the model to better utilize information from both views, resulting in balanced attention between both views and increased performance (bottom).
  • Figure 2: Multi-view ViTs with Random Token Fusion (RTF). RTF utilizes a local encoder to generate representations of different views, followed by a token fusion module. This module divides the feature fusion into two distinct branches. One branch uses some strategy to merge all tokens from both views, while the other one randomly drops spatial tokens from each view before mixing them. The fused tokens are processed by a global encoder, which produces two types of predictions: one for the global tokens and one for the RTF tokens. During training, the loss for both branches is minimized towards the same task. After training, RTF tokens are not generated, they are merged using the model's fusion method and passed to the global encoder for inference.
  • Figure 3: Illustration of different fusion strategies.(Left) Common fusion strategies to fuse the features (tokens) of different views in ViTs. (Right) The proposed random token fusion (RTF) strategy. In RTF, we randomly drop spatial tokens from both images and combine the remaining ones, augmenting the representations during training.
  • Figure 4: Extended results on CBIS-DDSM (top) and CheXpert (bottom), showing the model's attention maps within the last block of the global encoder. RTF seems to address the issue of attention being allocated to uninformative areas, a common phenomenon observed in ViTs. It also encourages the model to focus on both views in many cases.