Ensemble Model With Bert,Roberta and Xlnet For Molecular property prediction

Junling Hu

Ensemble Model With Bert,Roberta and Xlnet For Molecular property prediction

Junling Hu

TL;DR

The paper tackles molecular property prediction under resource constraints by proposing an AIS-based representation and an ensemble of Transformer models (BERT, RoBERTa, XLNet) fine-tuned from random initialization. The AIS tokenization provides richer molecular encoding, fed to a BiLSTM base predictor and a BaggingRegressor meta-learner. On Zinc250k and Zinc310k, the approach achieves state-of-the-art MAE and R² for properties such as QED, logP, and MolWt, often outperforming strong baselines like ASVAE, GROVER, CHEM-BERT, and D-MPNN. Ablation studies show AIS inputs generally improve performance and that BERT-based variants are particularly effective. This work demonstrates that high-accuracy molecular property prediction is feasible without large-scale pretraining, enabling efficient deployment in resource-limited settings with significant practical impact for chemical discovery.

Abstract

This paper presents a novel approach for predicting molecular properties with high accuracy without the need for extensive pre-training. Employing ensemble learning and supervised fine-tuning of BERT, RoBERTa, and XLNet, our method demonstrates significant effectiveness compared to existing advanced models. Crucially, it addresses the issue of limited computational resources faced by experimental groups, enabling them to accurately predict molecular properties. This innovation provides a cost-effective and resource-efficient solution, potentially advancing further research in the molecular domain.

Ensemble Model With Bert,Roberta and Xlnet For Molecular property prediction

TL;DR

Abstract

Paper Structure (12 sections, 3 equations, 7 figures, 8 tables)

This paper contains 12 sections, 3 equations, 7 figures, 8 tables.

Introduction
Relevant Work
Methodology
Data Set
Data Preprocessing
Ensemble Model
Baseline Methods
Evaluation Metrics
Experiment
Ablation Study
Conclusion
Data Availability

Figures (7)

Figure 1: ZINC250k dataset qed and logP property distribution histogram
Figure 2: ZINC310k dataset qed, logP and MolWt property distribution histogram
Figure 3: An example illustrates the SMILES expression of the Styrene molecule (C(=C)C1=CC=CC=C1) and the step-by-step transformation process into AIS.
Figure 4: The vocabularies for AIS(Up) and SMILES(Down) were created based on the zinc250k and zinc310k datasets
Figure 5: The structure of ensemble model
...and 2 more figures

Ensemble Model With Bert,Roberta and Xlnet For Molecular property prediction

TL;DR

Abstract

Ensemble Model With Bert,Roberta and Xlnet For Molecular property prediction

Authors

TL;DR

Abstract

Table of Contents

Figures (7)