Table of Contents
Fetching ...

InstructPLM-mu: 1-Hour Fine-Tuning of ESM2 Beats ESM3 in Protein Mutation Predictions

Junde Xu, Yapin Shi, Lijun Lang, Taoyong Cui, Zhiming Zhang, Guangyong Chen, Jiezhong Qiu, Pheng-Ann Heng

TL;DR

This work tackles the challenge of predicting protein mutation effects without prohibitive compute by fine-tuning pretrained sequence PLMs with structural context. It introduces InstructPLM-mu and systematically evaluates three fusion designs, showing that token-wise fusion with efficient adapters enables bidirectional structure integration and yields strong zero-shot mutation predictions. The results demonstrate that, after roughly 1 hour of fine-tuning on an ESM2 backbone, the multimodal approach can match or exceed the performance of end-to-end trained multimodal models such as ESM3, with the best gains achieved through token-wise fusion and parameter-efficient tuning. The findings provide practical guidance on fusion mechanisms and tuning protocols, highlighting a scalable path to incorporate structural information into large-scale protein language models for mutation-effect prediction.

Abstract

Multimodal protein language models deliver strong performance on mutation-effect prediction, but training such models from scratch demands substantial computational resources. In this paper, we propose a fine-tuning framework called InstructPLM-mu and try to answer a question: \textit{Can multimodal fine-tuning of a pretrained, sequence-only protein language model match the performance of models trained end-to-end? } Surprisingly, our experiments show that fine-tuning ESM2 with structural inputs can reach performance comparable to ESM3. To understand how this is achieved, we systematically compare three different feature-fusion designs and fine-tuning recipes. Our results reveal that both the fusion method and the tuning strategy strongly affect final accuracy, indicating that the fine-tuning process is not trivial. We hope this work offers practical guidance for injecting structure into pretrained protein language models and motivates further research on better fusion mechanisms and fine-tuning protocols.

InstructPLM-mu: 1-Hour Fine-Tuning of ESM2 Beats ESM3 in Protein Mutation Predictions

TL;DR

This work tackles the challenge of predicting protein mutation effects without prohibitive compute by fine-tuning pretrained sequence PLMs with structural context. It introduces InstructPLM-mu and systematically evaluates three fusion designs, showing that token-wise fusion with efficient adapters enables bidirectional structure integration and yields strong zero-shot mutation predictions. The results demonstrate that, after roughly 1 hour of fine-tuning on an ESM2 backbone, the multimodal approach can match or exceed the performance of end-to-end trained multimodal models such as ESM3, with the best gains achieved through token-wise fusion and parameter-efficient tuning. The findings provide practical guidance on fusion mechanisms and tuning protocols, highlighting a scalable path to incorporate structural information into large-scale protein language models for mutation-effect prediction.

Abstract

Multimodal protein language models deliver strong performance on mutation-effect prediction, but training such models from scratch demands substantial computational resources. In this paper, we propose a fine-tuning framework called InstructPLM-mu and try to answer a question: \textit{Can multimodal fine-tuning of a pretrained, sequence-only protein language model match the performance of models trained end-to-end? } Surprisingly, our experiments show that fine-tuning ESM2 with structural inputs can reach performance comparable to ESM3. To understand how this is achieved, we systematically compare three different feature-fusion designs and fine-tuning recipes. Our results reveal that both the fusion method and the tuning strategy strongly affect final accuracy, indicating that the fine-tuning process is not trivial. We hope this work offers practical guidance for injecting structure into pretrained protein language models and motivates further research on better fusion mechanisms and fine-tuning protocols.

Paper Structure

This paper contains 19 sections, 11 equations, 6 figures, 6 tables.

Figures (6)

  • Figure 1: Protein mutation prediction performance of InstructPLM-mu and ESM3. After 1 hour of fine-tuning, InstructPLM-mu on the 150M ESM2 backbone overtakes ESM3.
  • Figure 2: Comparison of three different multimodal fusion strategies of InstructPLM-mu: cross attention (Left), Channel-wise Concat (Middle), and Token-wise Concat (Right).
  • Figure 3: Schematic of the three fine-tuning strategies. Left, Adapter-only: the backbone is frozen and only the adapters that project structural features into the PLM are learned. Middle, LoRA + Adapter: adapters inject structural embeddings while low-rank (LoRA) updates are applied to selected transformer weights. Right, Full Fine-tune: all transformer blocks and adapter modules are updated.
  • Figure 4: Spearman correlations on individual DMS datasets, sorted by ESM2 (650 M) performance. Baselines use cross markers; InstructPLM-mu are shown as colored circles.
  • Figure 5: Spearman correlations on individual DMS datasets, sorted by ESM2 (650 M) performance. Baselines use cross markers; InstructPLM-mu are shown as colored circles.
  • ...and 1 more figures