Table of Contents
Fetching ...

BioVERSE: Representation Alignment of Biomedical Modalities to LLMs for Multi-Modal Reasoning

Ching-Huei Tsou, Michal Ozery-Flato, Ella Barkan, Diwakar Mahajan, Ben Shapira

TL;DR

BioVERSE addresses the challenge of siloed biomedical embeddings by introducing a modular encoder–projector–LLM framework that projects modality-specific BioFM embeddings into the LLM’s token space and treats them as special tokens for joint reasoning. It implements a two-stage training scheme—alignment (autoregressive or contrastive) followed by light instruction tuning with LoRA—to enable zero-shot cross-modal tasks across scRNA-seq, proteins, and small molecules. Across cell-type annotation, molecular description, and protein-oriented text generation, BioVERSE with compact backbones matches or surpasses larger text-only baselines while providing richer, explainable outputs and maintaining deployment practicality. The approach is modular and extensible, enabling on-prem deployment and future expansion to additional modalities and backbones, with open-sourcing to foster community benchmarking and advancement in embedding-aware biomedical reasoning.

Abstract

Recent advances in large language models (LLMs) and biomedical foundation models (BioFMs) have achieved strong results in biological text reasoning, molecular modeling, and single-cell analysis, yet they remain siloed in disjoint embedding spaces, limiting cross-modal reasoning. We present BIOVERSE (Biomedical Vector Embedding Realignment for Semantic Engagement), a two-stage approach that adapts pretrained BioFMs as modality encoders and aligns them with LLMs through lightweight, modality-specific projection layers. The approach first aligns each modality to a shared LLM space through independently trained projections, allowing them to interoperate naturally, and then applies standard instruction tuning with multi-modal data to bring them together for downstream reasoning. By unifying raw biomedical data with knowledge embedded in LLMs, the approach enables zero-shot annotation, cross-modal question answering, and interactive, explainable dialogue. Across tasks spanning cell-type annotation, molecular description, and protein function reasoning, compact BIOVERSE configurations surpass larger LLM baselines while enabling richer, generative outputs than existing BioFMs, establishing a foundation for principled multi-modal biomedical reasoning.

BioVERSE: Representation Alignment of Biomedical Modalities to LLMs for Multi-Modal Reasoning

TL;DR

BioVERSE addresses the challenge of siloed biomedical embeddings by introducing a modular encoder–projector–LLM framework that projects modality-specific BioFM embeddings into the LLM’s token space and treats them as special tokens for joint reasoning. It implements a two-stage training scheme—alignment (autoregressive or contrastive) followed by light instruction tuning with LoRA—to enable zero-shot cross-modal tasks across scRNA-seq, proteins, and small molecules. Across cell-type annotation, molecular description, and protein-oriented text generation, BioVERSE with compact backbones matches or surpasses larger text-only baselines while providing richer, explainable outputs and maintaining deployment practicality. The approach is modular and extensible, enabling on-prem deployment and future expansion to additional modalities and backbones, with open-sourcing to foster community benchmarking and advancement in embedding-aware biomedical reasoning.

Abstract

Recent advances in large language models (LLMs) and biomedical foundation models (BioFMs) have achieved strong results in biological text reasoning, molecular modeling, and single-cell analysis, yet they remain siloed in disjoint embedding spaces, limiting cross-modal reasoning. We present BIOVERSE (Biomedical Vector Embedding Realignment for Semantic Engagement), a two-stage approach that adapts pretrained BioFMs as modality encoders and aligns them with LLMs through lightweight, modality-specific projection layers. The approach first aligns each modality to a shared LLM space through independently trained projections, allowing them to interoperate naturally, and then applies standard instruction tuning with multi-modal data to bring them together for downstream reasoning. By unifying raw biomedical data with knowledge embedded in LLMs, the approach enables zero-shot annotation, cross-modal question answering, and interactive, explainable dialogue. Across tasks spanning cell-type annotation, molecular description, and protein function reasoning, compact BIOVERSE configurations surpass larger LLM baselines while enabling richer, generative outputs than existing BioFMs, establishing a foundation for principled multi-modal biomedical reasoning.

Paper Structure

This paper contains 37 sections, 5 equations, 3 figures, 3 tables.

Figures (3)

  • Figure 1: BioVERSE base architecture: a modality-specific BioFM encodes a biological entity, and its output embeddings are mapped by a projection layer into the LLM's embedding space via special tokens (e.g. [BIO]). In the alignment stage, only the projection layer $P_\theta$ is trainable, while the encoder $f_b$ and the LLM $g$ remain frozen. In the subsequent instruction-tuning stage, we allow both $P_\theta$ and the low-rank adapter (LoRA) within the LLM to be trainable. Stage 1 (S1) can be trained using autoregressive (AR) or contrastive (CT) loss, while stage 2 (S2) is always AR.
  • Figure 2: UMAP visualization of scRNA-seq and text embeddings. Left: before alignment, cell embeddings (green) form isolated clusters within the LLM embedding space. Right: after alignment, cell embeddings are pulled closer to biologically relevant text and separated from unrelated general-domain text. BioVERSE successfully realigns the modalities into a shared representation space.
  • Figure 3: Example generative annotation on PBMC10K: BioVERSE produces the label and reasoning grounded in gene evidence.