Table of Contents
Fetching ...

Graph-Based Bidirectional Transformer Decision Threshold Adjustment Algorithm for Class-Imbalanced Molecular Data

Nicole Hayes, Ekaterina Merkurjev, Guo-Wei Wei

TL;DR

The proposed BTDT-MBO algorithm, incorporating Merriman-Bence-Osher (MBO) techniques and a bidirectional transformer, as well as distance correlation and decision threshold adjustments, for data classification problems on highly imbalanced molecular data sets, where the sizes of the classes vary greatly.

Abstract

Data sets with imbalanced class sizes, where one class size is much smaller than that of others, occur exceedingly often in many applications, including those with biological foundations, such as disease diagnosis and drug discovery. Therefore, it is extremely important to be able to identify data elements of classes of various sizes, as a failure to do so can result in heavy costs. Nonetheless, many data classification procedures do not perform well on imbalanced data sets as they often fail to detect elements belonging to underrepresented classes. In this work, we propose the BTDT-MBO algorithm, incorporating Merriman-Bence-Osher (MBO) approaches and a bidirectional transformer, as well as distance correlation and decision threshold adjustments, for data classification tasks on highly imbalanced molecular data sets, where the sizes of the classes vary greatly. The proposed technique not only integrates adjustments in the classification threshold for the MBO algorithm in order to help deal with the class imbalance, but also uses a bidirectional transformer procedure based on an attention mechanism for self-supervised learning. In addition, the model implements distance correlation as a weight function for the similarity graph-based framework on which the adjusted MBO algorithm operates. The proposed method is validated using six molecular data sets and compared to other related techniques. The computational experiments show that the proposed technique is superior to competing approaches even in the case of a high class imbalance ratio.

Graph-Based Bidirectional Transformer Decision Threshold Adjustment Algorithm for Class-Imbalanced Molecular Data

TL;DR

The proposed BTDT-MBO algorithm, incorporating Merriman-Bence-Osher (MBO) techniques and a bidirectional transformer, as well as distance correlation and decision threshold adjustments, for data classification problems on highly imbalanced molecular data sets, where the sizes of the classes vary greatly.

Abstract

Data sets with imbalanced class sizes, where one class size is much smaller than that of others, occur exceedingly often in many applications, including those with biological foundations, such as disease diagnosis and drug discovery. Therefore, it is extremely important to be able to identify data elements of classes of various sizes, as a failure to do so can result in heavy costs. Nonetheless, many data classification procedures do not perform well on imbalanced data sets as they often fail to detect elements belonging to underrepresented classes. In this work, we propose the BTDT-MBO algorithm, incorporating Merriman-Bence-Osher (MBO) approaches and a bidirectional transformer, as well as distance correlation and decision threshold adjustments, for data classification tasks on highly imbalanced molecular data sets, where the sizes of the classes vary greatly. The proposed technique not only integrates adjustments in the classification threshold for the MBO algorithm in order to help deal with the class imbalance, but also uses a bidirectional transformer procedure based on an attention mechanism for self-supervised learning. In addition, the model implements distance correlation as a weight function for the similarity graph-based framework on which the adjusted MBO algorithm operates. The proposed method is validated using six molecular data sets and compared to other related techniques. The computational experiments show that the proposed technique is superior to competing approaches even in the case of a high class imbalance ratio.
Paper Structure (14 sections, 7 equations, 7 figures, 2 tables, 2 algorithms)

This paper contains 14 sections, 7 equations, 7 figures, 2 tables, 2 algorithms.

Figures (7)

  • Figure 1: Flowchart illustrating the BTDT-MBO method.
  • Figure 2: Example ROC curves for two of the data sets (CHEMBL1909134 and CHEMBL1909150) used for benchmarking the proposed method. Curves were constructed in MATLAB for a random labeled/unlabeled split of each data set using a set of increasing thresholds from 0 to 1, with each successive threshold increasing by 0.05. The average ROC-AUC scores for both data sets over all 50 random partitions can be seen in Figure \ref{['dm-fig']}.
  • Figure 3: Comparison to other techniques on DrugMatrix data sets. The results of our proposed method are in red, while those of other algorithms are in blue. The imbalance ratios for the pictured data sets vary from 16.5 to 20.0. Detailed information about the overall size and composition of the comparison data sets is described in Section \ref{['data-sets']}. The performance metric is the ROC-AUC score averaged over 50 random training-testing (or labeled/unlabeled) splits of the given data, with 80% of the data being labeled in each case. The BTDT-MBO result for each data set is the highest of the BTDT-MBO model using a Gaussian weight function and the BTDT-MBO model using a distance correlation weight function. Some comparison results were generated using the GHOST algorithm esposito2021 via random forests (RF), extreme gradient boosting (XGB), logistic regression (LR) and gradient boosting (GB). Additional comparison results were generated using the BT-MBO algorithm hayes2023 as well as BT-GB, BT-RF, and BT-SVM models (consisting of the BT-FPs passed to gradient boosting, random forest, and support vector machine algorithms, respectively).
  • Figure 4: Comparison to other techniques on two DS1 data sets. The results of our proposed method are in red, while those of other algorithms are in blue. The imbalance ratios for the pictured data sets are both 20.0. Detailed information about the overall size and composition of the comparison data sets is described in Section \ref{['data-sets']}. The metric is the ROC-AUC score averaged over 50 splits of the data, with 80% of the data being labeled in each case. The BTDT-MBO result for each data set is the highest of the BTDT-MBO model using a Gaussian weight function and the BTDT-MBO model using a distance correlation weight function. Some comparison results were generated using the GHOST algorithm esposito2021 via random forests (RF), extreme gradient boosting (XGB), logistic regression (LR) and gradient boosting (GB). Additional comparison results were generated using the BT-MBO algorithm hayes2023 as well as BT-GB, BT-RF, and BT-SVM models (consisting of the BT-FPs passed to gradient boosting, random forest, and support vector machine algorithms, respectively).
  • Figure 5: R-S score plots for the four DrugMatrix data sets. The plots display R-S scores of unlabeled points from a random labeled/unlabeled partition of each data set. The $x$ and $y$ axes of the plots represent residue and similarity scores, respectively. From left to right, the panels plot points in class 1 (i.e., inactive compounds) and class 2 (i.e., active compounds). Each point is colored based on its class predicted by the proposed BTDT-MBO model.
  • ...and 2 more figures