Table of Contents
Fetching ...

Oxytrees: Model Trees for Bipartite Learning

Pedro Ilídio, Felipe Kenji Nakano, Alireza Gharahighehi, Robbe D'hondt, Ricardo Cerri, Celine Vens

TL;DR

This work addresses inductive bipartite learning for predicting interactions between two distinct object types, focusing on scalability and generalization. It introduces Oxytrees, proxy-based biclustering model trees that compress the interaction matrix with row/column proxies, use a fast split criterion, a batch inference scheme, and leaf models based on Regularized Least Squares with a Kronecker kernel. Empirical results across 15 datasets show competitive performance to state-of-the-art methods while achieving up to around 30× faster training and near 10× faster inference, with fewer trees needed to reach high accuracy. The approach combines interpretability, efficiency, and strong inductive capability, supported by an accessible Python API for reproducible research.

Abstract

Bipartite learning is a machine learning task that aims to predict interactions between pairs of instances. It has been applied to various domains, including drug-target interactions, RNA-disease associations, and regulatory network inference. Despite being widely investigated, current methods still present drawbacks, as they are often designed for a specific application and thus do not generalize to other problems or present scalability issues. To address these challenges, we propose Oxytrees: proxy-based biclustering model trees. Oxytrees compress the interaction matrix into row- and column-wise proxy matrices, significantly reducing training time without compromising predictive performance. We also propose a new leaf-assignment algorithm that significantly reduces the time taken for prediction. Finally, Oxytrees employ linear models using the Kronecker product kernel in their leaves, resulting in shallower trees and thus even faster training. Using 15 datasets, we compared the predictive performance of ensembles of Oxytrees with that of the current state-of-the-art. We achieved up to 30-fold improvement in training times compared to state-of-the-art biclustering forests, while demonstrating competitive or superior performance in most evaluation settings, particularly in the inductive setting. Finally, we provide an intuitive Python API to access all datasets, methods and evaluation measures used in this work, thus enabling reproducible research in this field.

Oxytrees: Model Trees for Bipartite Learning

TL;DR

This work addresses inductive bipartite learning for predicting interactions between two distinct object types, focusing on scalability and generalization. It introduces Oxytrees, proxy-based biclustering model trees that compress the interaction matrix with row/column proxies, use a fast split criterion, a batch inference scheme, and leaf models based on Regularized Least Squares with a Kronecker kernel. Empirical results across 15 datasets show competitive performance to state-of-the-art methods while achieving up to around 30× faster training and near 10× faster inference, with fewer trees needed to reach high accuracy. The approach combines interpretability, efficiency, and strong inductive capability, supported by an accessible Python API for reproducible research.

Abstract

Bipartite learning is a machine learning task that aims to predict interactions between pairs of instances. It has been applied to various domains, including drug-target interactions, RNA-disease associations, and regulatory network inference. Despite being widely investigated, current methods still present drawbacks, as they are often designed for a specific application and thus do not generalize to other problems or present scalability issues. To address these challenges, we propose Oxytrees: proxy-based biclustering model trees. Oxytrees compress the interaction matrix into row- and column-wise proxy matrices, significantly reducing training time without compromising predictive performance. We also propose a new leaf-assignment algorithm that significantly reduces the time taken for prediction. Finally, Oxytrees employ linear models using the Kronecker product kernel in their leaves, resulting in shallower trees and thus even faster training. Using 15 datasets, we compared the predictive performance of ensembles of Oxytrees with that of the current state-of-the-art. We achieved up to 30-fold improvement in training times compared to state-of-the-art biclustering forests, while demonstrating competitive or superior performance in most evaluation settings, particularly in the inductive setting. Finally, we provide an intuitive Python API to access all datasets, methods and evaluation measures used in this work, thus enabling reproducible research in this field.

Paper Structure

This paper contains 19 sections, 10 equations, 6 figures, 3 tables.

Figures (6)

  • Figure 1: Diagram of the test (T) and training (i.e. learning, L) sets in the bipartite context (see \ref{['sec:bipartite validation']}).
  • Figure 2: Comparisons of AUPRC and AUROC of the proposed Oxytrees against previous methods for bipartite learning. The inductive setting is analysed, for 0% positives masking percent (PMP) (see \ref{['sec:bipartite validation']}). Further results in fig. F1.
  • Figure 3: Performance comparison of ensembles of Oxytrees using different components (\ref{['tab:ablation']}). To compare across datasets, the scores are divided by the score of the main model (Oxytrees). Each point represents a cross-validation fold, and the mean value is presented in the white boxes. Asterisks indicate significance in comparison to Oxytrees ($p<0.05$, Wilcoxon signed-rank), averaging folds for each dataset. Remaining results are presented in fig. F6.
  • Figure 4: Empirical complexity analysis. Single trees were applied to artificial datasets of different dimensions (\ref{['sec:empirical complexity']}). We used the last 10% of the points of each curve to approximate the asymptotic complexity as $\Theta (n^\alpha)$. $\alpha$ ($\pm$ standard dev.) is estimated as the slope of the linear regression in the log-log space. Further results in fig. F5.
  • Figure 5: Performance of biclustering forests as a function of minimum leaf dimensions, relative to the initial performance with 2 by 2 leaves. Values in the $x$ axis represent the minimum number of instances in each dimension. Markers indicate mean and standard deviation over 15 datasets.
  • ...and 1 more figures