LipidBERT: A Lipid Language Model Pre-trained on METiS de novo Lipid Library

Tianhao Yu; Cai Yao; Zhuorui Sun; Feng Shi; Lin Zhang; Kangjie Lyu; Xuan Bai; Andong Liu; Xicheng Zhang; Jiali Zou; Wenshou Wang; Chris Lai; Kai Wang

LipidBERT: A Lipid Language Model Pre-trained on METiS de novo Lipid Library

Tianhao Yu, Cai Yao, Zhuorui Sun, Feng Shi, Lin Zhang, Kangjie Lyu, Xuan Bai, Andong Liu, Xicheng Zhang, Jiali Zou, Wenshou Wang, Chris Lai, Kai Wang

TL;DR

This work addresses the limited availability of ionizable lipid structures for learning lipid representations and predicting LNP properties. It introduces LipidBERT, a BERT-like language model pre-trained on a 10-million virtual lipid library generated by METiS with MLM and diverse secondary tasks, and finetuned on wet-lab LNP data, including a bilingual capability for lipid structure and LNP contexts. The approach demonstrates that large virtual lipid libraries combined with self-supervised pre-training yield strong downstream performance (PCC around 0.80+ on organ fluorescence predictions) and outperform descriptor-based baselines and a GPT-like generator, with additional validation on public datasets like AGILE. The work showcases a practical, computation-guided pathway for lipid discovery, enabling rapid screening of new lipid candidates for organ-targeted LNPs through an integrated dry-wet lab framework and the AiLNP platform.

Abstract

In this study, we generate and maintain a database of 10 million virtual lipids through METiS's in-house de novo lipid generation algorithms and lipid virtual screening techniques. These virtual lipids serve as a corpus for pre-training, lipid representation learning, and downstream task knowledge transfer, culminating in state-of-the-art LNP property prediction performance. We propose LipidBERT, a BERT-like model pre-trained with the Masked Language Model (MLM) and various secondary tasks. Additionally, we compare the performance of embeddings generated by LipidBERT and PhatGPT, our GPT-like lipid generation model, on downstream tasks. The proposed bilingual LipidBERT model operates in two languages: the language of ionizable lipid pre-training, using in-house dry-lab lipid structures, and the language of LNP fine-tuning, utilizing in-house LNP wet-lab data. This dual capability positions LipidBERT as a key AI-based filter for future screening tasks, including new versions of METiS de novo lipid libraries and, more importantly, candidates for in vivo testing for orgran-targeting LNPs. To the best of our knowledge, this is the first successful demonstration of the capability of a pre-trained language model on virtual lipids and its effectiveness in downstream tasks using web-lab data. This work showcases the clever utilization of METiS's in-house de novo lipid library as well as the power of dry-wet lab integration.

LipidBERT: A Lipid Language Model Pre-trained on METiS de novo Lipid Library

TL;DR

Abstract

Paper Structure (24 sections, 13 figures)

This paper contains 24 sections, 13 figures.

Introduction
Methods & Results
METiS de novo Lipid Library
Lipid Generation Models, METiS de novo Lipid Library and Lipid/LNP Prediction Models
METiS de novo Lipid Library Facilitator
Real-World Lipid Filter
Pre-Training
Masked Language Model (MLM)
Number of Tails Prediction
Connecting Atom Prediction - Sequence Classification
Connecting Atom Prediction - Token Classification
Head/Tail Classification
Rearranged/Decoy SMILES Classification
Fine-Tuning
Fine-Tuning Using Our Wet-Lab Experimental Dataset
...and 9 more sections

Figures (13)

Figure 1: Schematic representation of (a) the METiS de novo lipid library facilitator, and (b) the real-world lipid filter. "Unpublished" refers to models and experimental methods/results that may be published in the future and has not been discussed in details in this study.
Figure 2: Schematic representation of the Masked Language Model (MLM) and various secondary tasks, including Number of Tails Prediction, Connecting Atom Prediction - Sequence/Token Classification, and Head/Tail Classification.
Figure 3: Projected embeddings on 2D from the 768-dimensional [CLS] embeddings generated via the pre-trained Masked Language Model. Dimensionality reduction was performed using (a) UMAP mcinnes2018umap, and (b) t-SNE van2008tsne.
Figure 4: Projected embeddings on 2D from the 768-dimensional [CLS] embeddings generated via the Masked Language Model + Number of Tails model. Dimensionality reduction was performed using UMAP mcinnes2018umap.
Figure 5: Visualization of the predicted connecting atom between head and tail via sequence classification, using the connecting points prediction head in the pre-trained model. The predicted atoms are explicitly marked. The visualization was created using RDKit landrum2006rdkit. We intentionally selected six lipids with low AI-predicted values but high diversity from the sampled set.
...and 8 more figures

LipidBERT: A Lipid Language Model Pre-trained on METiS de novo Lipid Library

TL;DR

Abstract

LipidBERT: A Lipid Language Model Pre-trained on METiS de novo Lipid Library

Authors

TL;DR

Abstract

Table of Contents

Figures (13)