Table of Contents
Fetching ...

Residual-based Language Models are Free Boosters for Biomedical Imaging

Zhixin Lai, Jing Wu, Suiyao Chen, Yucheng Zhou, Naira Hovakimyan

TL;DR

The paper proposes a residual-based booster (R-LLM) that inserts a frozen LLM transformer block into a ViT-based visual encoder to enhance biomedical imaging without language inputs. By employing trainable adaptation layers around the frozen block and residual connections both before and after the LLM, the method consistently improves 2D and 3D classification performance and even achieves state-of-the-art results on MedMNIST datasets. The approach avoids language prompts, cross-modal alignment, and pre-trained vision-language encoders, and it typically benefits from keeping the LLM block frozen during training. Ablation studies and Grad-CAM visualizations corroborate the importance of the residual design and the LLM weights in boosting performance, suggesting broad potential for LLMs as general-purpose visual boosters in biomedical imaging.

Abstract

In this study, we uncover the unexpected efficacy of residual-based large language models (LLMs) as part of encoders for biomedical imaging tasks, a domain traditionally devoid of language or textual data. The approach diverges from established methodologies by utilizing a frozen transformer block, extracted from pre-trained LLMs, as an innovative encoder layer for the direct processing of visual tokens. This strategy represents a significant departure from the standard multi-modal vision-language frameworks, which typically hinge on language-driven prompts and inputs. We found that these LLMs could boost performance across a spectrum of biomedical imaging applications, including both 2D and 3D visual classification tasks, serving as plug-and-play boosters. More interestingly, as a byproduct, we found that the proposed framework achieved superior performance, setting new state-of-the-art results on extensive, standardized datasets in MedMNIST-2D and 3D. Through this work, we aim to open new avenues for employing LLMs in biomedical imaging and enriching the understanding of their potential in this specialized domain.

Residual-based Language Models are Free Boosters for Biomedical Imaging

TL;DR

The paper proposes a residual-based booster (R-LLM) that inserts a frozen LLM transformer block into a ViT-based visual encoder to enhance biomedical imaging without language inputs. By employing trainable adaptation layers around the frozen block and residual connections both before and after the LLM, the method consistently improves 2D and 3D classification performance and even achieves state-of-the-art results on MedMNIST datasets. The approach avoids language prompts, cross-modal alignment, and pre-trained vision-language encoders, and it typically benefits from keeping the LLM block frozen during training. Ablation studies and Grad-CAM visualizations corroborate the importance of the residual design and the LLM weights in boosting performance, suggesting broad potential for LLMs as general-purpose visual boosters in biomedical imaging.

Abstract

In this study, we uncover the unexpected efficacy of residual-based large language models (LLMs) as part of encoders for biomedical imaging tasks, a domain traditionally devoid of language or textual data. The approach diverges from established methodologies by utilizing a frozen transformer block, extracted from pre-trained LLMs, as an innovative encoder layer for the direct processing of visual tokens. This strategy represents a significant departure from the standard multi-modal vision-language frameworks, which typically hinge on language-driven prompts and inputs. We found that these LLMs could boost performance across a spectrum of biomedical imaging applications, including both 2D and 3D visual classification tasks, serving as plug-and-play boosters. More interestingly, as a byproduct, we found that the proposed framework achieved superior performance, setting new state-of-the-art results on extensive, standardized datasets in MedMNIST-2D and 3D. Through this work, we aim to open new avenues for employing LLMs in biomedical imaging and enriching the understanding of their potential in this specialized domain.
Paper Structure (24 sections, 3 equations, 3 figures, 6 tables)

This paper contains 24 sections, 3 equations, 3 figures, 6 tables.

Figures (3)

  • Figure 1: R-LLM benefits baseline models on a broad range of datasets in biomedical imaging tasks under the AUC metric.
  • Figure 2: The proposed framework of applying language models as a booster for biomedical imaging classification task. We use Vision Transformer (ViT) from dosovitskiy2020image for demonstration.
  • Figure 3: Visual inspection of ViT-S and ViT-S with R-LLM using Grad-CAM on original OCTMNIST dataset.