Table of Contents
Fetching ...

Aligning LLMs with Biomedical Knowledge using Balanced Fine-Tuning

Zhenchao Tang, Fang Wang, Haohuai He, Jiale Zhou, Tianxu Lv, Jun Zhu, Shouzhi Chen, Minghao Yang, Yu Wang, Jiayang Wu, Yidong Song, Jianhua Yao

TL;DR

This work tackles the challenge of aligning LLMs with sparse biomedical knowledge by introducing Balanced Fine-Tuning (BFT), a lightweight post-training method that avoids costly reinforcement learning. BFT adds two adaptive weighting layers: token-level confidence to stabilize gradients and sample-level weighting based on the minimum group confidence to emphasize difficult spans, formalized as L_BFT(θ) = (1/B) ∑_{b=1}^B s_b ( ∑_t m_{b,t} w_{b,t} l_{b,t} ) / ( ∑_t m_{b,t} + ε ), with c_{b,t} and p_b^{conf} guiding the weights. Empirically, BFT improves medical reasoning, reduces forgetting on general-domain benchmarks, and yields biologically meaningful representations, enabling downstream tasks such as gene interaction prediction and single-cell perturbation response forecasting, while outperforming baselines like GeneAgent in biology without external APIs. The results suggest that BFT generalizes beyond domain-specific tasks by embedding domain knowledge into the LLM’s representations through adaptive learning from biomedical data, offering a practical RL-free pathway to integrated biomedical reasoning. Overall, BFT provides a general, scalable framework to augment LLMs with structured biomedical knowledge, with broad implications for biomedical research and AI-assisted life sciences.

Abstract

Effective post-training is essential to align Large Language Models (LLMs) with specialized biomedical knowledge to accelerate life science research. However, current approaches face significant limitations. First, biomedical reasoning involves intricate mechanisms often represented by sparse textual data. Standard Supervised Fine-Tuning (SFT) tends to overfit to surface-level instruction patterns without effectively internalizing this fragmented scientific knowledge. Second, Reinforcement Learning (RL) is impractical for this domain, as defining meaningful rewards often necessitates prohibitive experimental validation (e.g., wet-lab verification of drug responses), rendering real-time feedback unfeasible. We propose Balanced Fine-Tuning (BFT), an efficient post-training method designed to learn complex reasoning from sparse data without external reward signals. BFT operates through a two-layer weighting mechanism: 1. At the token level, it scales loss via prediction probabilities to stabilize gradients and prevent overfitting; 2. At the sample level, it uses "minimum group confidence" to adaptively enhance the learning of hard samples. Experiments demonstrate that BFT significantly outperforms SFT. In medical tasks, it enables LLMs to acquire knowledge that SFT misses. In biological tasks, BFT-based LLMs surpass GeneAgent (an accurate agent for biology analysis) in biological process reasoning. Moreover, the text embeddings generated by BFT can be directly applied to downstream tasks, such as gene interaction and single-cell perturbation response prediction. These results indicate that BFT facilitates broad applications of LLMs in biomedical research.

Aligning LLMs with Biomedical Knowledge using Balanced Fine-Tuning

TL;DR

This work tackles the challenge of aligning LLMs with sparse biomedical knowledge by introducing Balanced Fine-Tuning (BFT), a lightweight post-training method that avoids costly reinforcement learning. BFT adds two adaptive weighting layers: token-level confidence to stabilize gradients and sample-level weighting based on the minimum group confidence to emphasize difficult spans, formalized as L_BFT(θ) = (1/B) ∑_{b=1}^B s_b ( ∑_t m_{b,t} w_{b,t} l_{b,t} ) / ( ∑_t m_{b,t} + ε ), with c_{b,t} and p_b^{conf} guiding the weights. Empirically, BFT improves medical reasoning, reduces forgetting on general-domain benchmarks, and yields biologically meaningful representations, enabling downstream tasks such as gene interaction prediction and single-cell perturbation response forecasting, while outperforming baselines like GeneAgent in biology without external APIs. The results suggest that BFT generalizes beyond domain-specific tasks by embedding domain knowledge into the LLM’s representations through adaptive learning from biomedical data, offering a practical RL-free pathway to integrated biomedical reasoning. Overall, BFT provides a general, scalable framework to augment LLMs with structured biomedical knowledge, with broad implications for biomedical research and AI-assisted life sciences.

Abstract

Effective post-training is essential to align Large Language Models (LLMs) with specialized biomedical knowledge to accelerate life science research. However, current approaches face significant limitations. First, biomedical reasoning involves intricate mechanisms often represented by sparse textual data. Standard Supervised Fine-Tuning (SFT) tends to overfit to surface-level instruction patterns without effectively internalizing this fragmented scientific knowledge. Second, Reinforcement Learning (RL) is impractical for this domain, as defining meaningful rewards often necessitates prohibitive experimental validation (e.g., wet-lab verification of drug responses), rendering real-time feedback unfeasible. We propose Balanced Fine-Tuning (BFT), an efficient post-training method designed to learn complex reasoning from sparse data without external reward signals. BFT operates through a two-layer weighting mechanism: 1. At the token level, it scales loss via prediction probabilities to stabilize gradients and prevent overfitting; 2. At the sample level, it uses "minimum group confidence" to adaptively enhance the learning of hard samples. Experiments demonstrate that BFT significantly outperforms SFT. In medical tasks, it enables LLMs to acquire knowledge that SFT misses. In biological tasks, BFT-based LLMs surpass GeneAgent (an accurate agent for biology analysis) in biological process reasoning. Moreover, the text embeddings generated by BFT can be directly applied to downstream tasks, such as gene interaction and single-cell perturbation response prediction. These results indicate that BFT facilitates broad applications of LLMs in biomedical research.

Paper Structure

This paper contains 21 sections, 14 equations, 11 figures, 3 tables, 2 algorithms.

Figures (11)

  • Figure 1: BFT enhances the outputs of DeepSeek-R1-Distill series (14B, 32B and 70B). a: In the medical domain, theme-wise evaluation. b: In the medical domain, axis-wise evaluation. c: Forgetting evaluation in the general domain. We evaluated the general capabilities of LLMs previously fine-tuned on the OpenAI Health Bench Consensus subset using the MMLU benchmark. d: Forgetting evaluation in the general domain. We evaluated the general capabilities of LLMs previously fine-tuned on the OpenAI Health Bench Consensus subset using the CMMLU benchmark. e: In the biology domain, we applied BFT (blue) and SFT (orange) to fine-tune DeepSeek-R1-Distill series (14B, 32B and 70B). We evaluated LLMs on three biological process reasoning benchmarks. We evaluate ROUGE scores (recall-oriented understudy for gisting evaluation) between the generated final pathway names and ground truths, specifically ROUGE-L (longest common subsequence), ROUGE-1 (1-gram) and ROUGE-2 (2-gram) scores. f: We compared the BFT-based DeepSeek-R1-Distill (70B) against the latest baselines for biological process reasoning.
  • Figure 1: Ablation study. a: Test results on different mathematical reasoning datasets. We set two baselines: the red dashed line represents SFT, and the blue dashed line represents reinforcement learning (represented by GRPO). BFT includes three window length settings (BFT-128, BFT-256, and BFT-512). BFT w/o sample denotes removing the sample-level weighting mechanism from BFT (this setting does not require a sliding window). BFT w/o token denotes removing the token-level weighting mechanism from BFT (this setting requires a sliding window). b: Tracking the reasoning performance of BFT (with different window length settings) within 1 training epoch.
  • Figure 2: BFT learns representations with biological meaning. a: UMAP visualization of gene embeddings. From left to right are the gene embeddings of scGPT, the text embeddings of gene descriptions output by OpenAI ChatGPT, and the text embeddings of gene descriptions output by BFT-based DeepSeek-R1-Distill 70B. b: Representation evaluation at the gene level, with the task type being single-gene input. The classifier takes a single gene embedding as input and predicts its biological attributes, such as long-range and short-range transcription factors, dosage-sensitive and dosage-insensitive transcription factors, bivalent and Lys4-only methylated genes, and bivalent and non-methylated genes. c: Representation evaluation at the gene level, with the task type being multi-gene input. The embeddings of two genes or two proteins are concatenated, and the classifier predicts their interaction type. d: Representation evaluation at the cell level. On single-cell data, cell embeddings are obtained by aggregating gene embeddings, and the evaluation includes phenotypes and cell types. e: Comparison of multimodal integration at the cell level, with the goal of integrating the two modalities of RNA and ADT. The three main columns (Bio conservation, Batch correction, and Aggregate score) respectively represent biological heterogeneity, modality mixing degree, and the overall metric. Each main column contains specific sub-metrics. For the first two columns, the color gradient from purple to green indicates scores from low to high. f: Comparison of single-cell perturbation response prediction results, with zero-shot prediction conducted on four perturbation datasets respectively.
  • Figure 2: The training runtime (unit: seconds) and evaluation scores of different methods. BFT includes three window length settings (128, 256, and 512), and the comparison methods include SFT and Focal loss. The training runtime of BFT is close to that of SFT, while its evaluation score is far higher than that of SFT.
  • Figure 3: This case demonstrates how to generate biological training data from an NCBI gene summary. The black text represents the prompt template, the blue text corresponds to the input text following the template (e.g., the gene summary of TP53), and the orange text shows the three GPT-generated training samples in SFT format.
  • ...and 6 more figures