Table of Contents
Fetching ...

LoGSAM: Parameter-Efficient Cross-Modal Grounding for MRI Segmentation

Mohammad Robaitul Islam Bhuiyan, Sheethal Bhat, Melika Qahqaie, Tri-Thien Nguyen, Paula Andrea Pérez Toro, Tomas Arias Vergara, Andreas Maier

Abstract

Precise localization and delineation of brain tumors using Magnetic Resonance Imaging (MRI) are essential for planning therapy and guiding surgical decisions. However, most existing approaches rely on task-specific supervised models and are constrained by the limited availability of annotated data. To address this, we propose LoGSAM, a parameter-efficient, detection-driven framework that transforms radiologist dictation into text prompts for foundation-model-based localization and segmentation. Radiologist speech is first transcribed and translated using a pretrained Whisper ASR model, followed by negation-aware clinical NLP to extract tumor-specific textual prompts. These prompts guide text-conditioned tumor localization via a LoRA-adapted vision-language detection model, Grounding DINO (GDINO). The LoRA adaptation updates using 5% of the model parameters, thereby enabling computationally efficient domain adaptation while preserving pretrained cross-modal knowledge. The predicted bounding boxes are used as prompts for MedSAM to generate pixel-level tumor masks without any additional fine-tuning. Conditioning the frozen MedSAM on LoGSAM-derived priors yields a state-of-the-art dice score of 80.32% on BRISC 2025. In addition, we evaluate the full pipeline using German dictations from a board-certified radiologist on 12 unseen MRI scans, achieving 91.7% case-level accuracy. These results highlight the feasibility of constructing a modular, speech-to-segmentation pipeline by intelligently leveraging pretrained foundation models with minimal parameter updates.

LoGSAM: Parameter-Efficient Cross-Modal Grounding for MRI Segmentation

Abstract

Precise localization and delineation of brain tumors using Magnetic Resonance Imaging (MRI) are essential for planning therapy and guiding surgical decisions. However, most existing approaches rely on task-specific supervised models and are constrained by the limited availability of annotated data. To address this, we propose LoGSAM, a parameter-efficient, detection-driven framework that transforms radiologist dictation into text prompts for foundation-model-based localization and segmentation. Radiologist speech is first transcribed and translated using a pretrained Whisper ASR model, followed by negation-aware clinical NLP to extract tumor-specific textual prompts. These prompts guide text-conditioned tumor localization via a LoRA-adapted vision-language detection model, Grounding DINO (GDINO). The LoRA adaptation updates using 5% of the model parameters, thereby enabling computationally efficient domain adaptation while preserving pretrained cross-modal knowledge. The predicted bounding boxes are used as prompts for MedSAM to generate pixel-level tumor masks without any additional fine-tuning. Conditioning the frozen MedSAM on LoGSAM-derived priors yields a state-of-the-art dice score of 80.32% on BRISC 2025. In addition, we evaluate the full pipeline using German dictations from a board-certified radiologist on 12 unseen MRI scans, achieving 91.7% case-level accuracy. These results highlight the feasibility of constructing a modular, speech-to-segmentation pipeline by intelligently leveraging pretrained foundation models with minimal parameter updates.
Paper Structure (15 sections, 5 equations, 2 figures, 4 tables)

This paper contains 15 sections, 5 equations, 2 figures, 4 tables.

Figures (2)

  • Figure 1: Overview of the proposed speech-guided detection-to-segmentation pipeline for brain tumor analysis. The speech signal $s(t)$ is transcribed and translated using Whisper radford2023robust ASR to obtain an English transcript $y = f_{\mathrm{ASR}}(s(t))$. The transcript is processed by spaCy honnibal2020spacy + negspaCy chapman2001negex to extract a tumor class cue and negation information, which are converted into a text prompt $p = g(c,n)$. In parallel, the T1-weighted brain MRI $I \in \mathbb{R}^{H \times W}$ is provided to a LoRA-augmented GDINO localizer, which combines $I$ and $p$ to predict bounding boxes $B = f_{\mathrm{GDINO},\theta}(I,p)$. Each predicted box $b_i \in B$ is then used as a prompt for MedSAM to generate a pixel-wise mask, yielding the final tumor segmentation output.
  • Figure 2: Qualitative results at different stages of the proposed detection–segmentation pipeline for two samples of each of the three tumor types. Each row shows two representative cases: glioma (top), meningioma (middle), and pituitary tumor (bottom).