BERTology of Molecular Property Prediction

Mohammad Mostafanejad; Paul Saxe; T. Daniel Crawford

BERTology of Molecular Property Prediction

Mohammad Mostafanejad, Paul Saxe, T. Daniel Crawford

Abstract

Chemical language models (CLMs) have emerged as promising competitors to popular classical machine learning models for molecular property prediction (MPP) tasks. However, an increasing number of studies have reported inconsistent and contradictory results for the performance of CLMs across various MPP benchmark tasks. In this study, we conduct and analyze hundreds of meticulously controlled experiments to systematically investigate the effects of various factors, such as dataset size, model size, and standardization, on the pre-training and fine-tuning performance of CLMs for MPP. In the absence of well-established scaling laws for encoder-only masked language models, our aim is to provide comprehensive numerical evidence and a deeper understanding of the underlying mechanisms affecting the performance of CLMs for MPP tasks, some of which appear to be entirely overlooked in the literature.

BERTology of Molecular Property Prediction

Abstract

Paper Structure (17 sections, 12 equations, 5 figures, 1 table)

This paper contains 17 sections, 12 equations, 5 figures, 1 table.

Results
Discussion
Methods
Data Availability
Code Availability

Figures (5)

Figure 1: The effect of standardization noise on the pre-training validation weighted-F1 score (a,c,e) and pseudo-perplexity (b,d,f) of BERT for masked language modeling. Both metrics are averaged over three independent runs with different data sampling and model initialization seeds. $\tau$ and $\nu$ control the standardization noise in the pre-training and validation splits, respectively. The error bars are based on 95% confidence interval which are not defined (NaN) for samples of $N=1$ runs, if the other two have diverged and removed from calculating the statistical summaries.
Figure 2: Variations of BERT's pre-training performance metrics: (a) validation loss (V-Loss), (b) accuracy (V-Acc), (c) weighted-F1 score (V-wF1), and (d) pseudo-perplexity (V-PPPL) versus the dataset size bin index, $k$ (the corresponding data percentages are shown on the top axes). Results pertinent to the Tiny-, Small- and Base-BERT are shown in blue circles, red squares and green triangles, respectively. Shaded bands are used to indicate the 95% confidence intervals across three independent runs with different model initialization and data sampling random seeds.
Figure 3: Variations of the fine-tuned BERT's testing performance metrics, (a) Pearson $R$, (b) $R^2$, (c) RMSE, (d) MAE, versus the pre-training dataset size bin index, $k$, for the HLM endpoint (the corresponding data percentages are shown on the top axes)
Figure 4: Variations of the fine-tuned BERT's testing performance metrics, (a) Pearson $R$, (b) $R^2$, (c) RMSE, (d) MAE, versus the pre-training dataset size bin index, $k$, for the hPPB endpoint (the corresponding data percentages are shown on the top axes)
Figure 5: Variations of the fine-tuned BERT's testing performance metrics, (a) Pearson $R$, (b) $R^2$, (c) RMSE, (d) MAE, versus the pre-training dataset size bin index, $k$, for the solubility endpoint (the corresponding data percentages are shown on the top axes)

BERTology of Molecular Property Prediction

Abstract

BERTology of Molecular Property Prediction

Authors

Abstract

Table of Contents

Figures (5)