Oxytrees: Model Trees for Bipartite Learning
Pedro Ilídio, Felipe Kenji Nakano, Alireza Gharahighehi, Robbe D'hondt, Ricardo Cerri, Celine Vens
TL;DR
This work addresses inductive bipartite learning for predicting interactions between two distinct object types, focusing on scalability and generalization. It introduces Oxytrees, proxy-based biclustering model trees that compress the interaction matrix with row/column proxies, use a fast split criterion, a batch inference scheme, and leaf models based on Regularized Least Squares with a Kronecker kernel. Empirical results across 15 datasets show competitive performance to state-of-the-art methods while achieving up to around 30× faster training and near 10× faster inference, with fewer trees needed to reach high accuracy. The approach combines interpretability, efficiency, and strong inductive capability, supported by an accessible Python API for reproducible research.
Abstract
Bipartite learning is a machine learning task that aims to predict interactions between pairs of instances. It has been applied to various domains, including drug-target interactions, RNA-disease associations, and regulatory network inference. Despite being widely investigated, current methods still present drawbacks, as they are often designed for a specific application and thus do not generalize to other problems or present scalability issues. To address these challenges, we propose Oxytrees: proxy-based biclustering model trees. Oxytrees compress the interaction matrix into row- and column-wise proxy matrices, significantly reducing training time without compromising predictive performance. We also propose a new leaf-assignment algorithm that significantly reduces the time taken for prediction. Finally, Oxytrees employ linear models using the Kronecker product kernel in their leaves, resulting in shallower trees and thus even faster training. Using 15 datasets, we compared the predictive performance of ensembles of Oxytrees with that of the current state-of-the-art. We achieved up to 30-fold improvement in training times compared to state-of-the-art biclustering forests, while demonstrating competitive or superior performance in most evaluation settings, particularly in the inductive setting. Finally, we provide an intuitive Python API to access all datasets, methods and evaluation measures used in this work, thus enabling reproducible research in this field.
