Table of Contents
Fetching ...

The Relevance Feature and Vector Machine for health applications

Albert Belenguer-Llorens, Carlos Sevilla-Salcedo, Emilio Parrado-Hernández, Vanessa Gómez-Verdejo

TL;DR

RFVM introduces a Bayesian framework that simultaneously selects informative features and relevant observations in fat-data health datasets by enforcing two-way sparsity through ARD-style folded-normal priors and variational inference. The model integrates primal and dual spaces with integrated pruning, enabling adaptive data acquisition for prospective clinical studies. It delivers competitive classification performance while yielding markedly smaller sets of features and relevance vectors, with interpretability validated in cancer gene-expression data and ALLAML biomarker analysis. Computationally, RFVM achieves sublinear scaling in feature dimensionality, making it suitable for large-scale high-dimensional medical data. Overall, RFVM offers a principled, scalable approach to compactly characterizing diseases and guiding efficient cohort recruitment in fat-data health applications.

Abstract

This paper presents the Relevance Feature and Vector Machine (RFVM), a novel model that addresses the challenges of the fat-data problem when dealing with clinical prospective studies. The fat-data problem refers to the limitations of Machine Learning (ML) algorithms when working with databases in which the number of features is much larger than the number of samples (a common scenario in certain medical fields). To overcome such limitations, the RFVM incorporates different characteristics: (1) A Bayesian formulation which enables the model to infer its parameters without overfitting thanks to the Bayesian model averaging. (2) A joint optimisation that overcomes the limitations arising from the fat-data characteristic by simultaneously including the variables that define the primal space (features) and those that define the dual space (observations). (3) An integrated prunning that removes the irrelevant features and samples during the training iterative optimization. Also, this last point turns out crucial when performing medical prospective studies, enabling researchers to exclude unnecessary medical tests, reducing costs and inconvenience for patients, and identifying the critical patients/subjects that characterize the disorder and, subsequently, optimize the patient recruitment process that leads to a balanced cohort. The model capabilities are tested against state-of-the-art models in several medical datasets with fat-data problems. These experimental works show that RFVM is capable of achieving competitive classification accuracies while providing the most compact subset of data (in both terms of features and samples). Moreover, the selected features (medical tests) seem to be aligned with the existing medical literature.

The Relevance Feature and Vector Machine for health applications

TL;DR

RFVM introduces a Bayesian framework that simultaneously selects informative features and relevant observations in fat-data health datasets by enforcing two-way sparsity through ARD-style folded-normal priors and variational inference. The model integrates primal and dual spaces with integrated pruning, enabling adaptive data acquisition for prospective clinical studies. It delivers competitive classification performance while yielding markedly smaller sets of features and relevance vectors, with interpretability validated in cancer gene-expression data and ALLAML biomarker analysis. Computationally, RFVM achieves sublinear scaling in feature dimensionality, making it suitable for large-scale high-dimensional medical data. Overall, RFVM offers a principled, scalable approach to compactly characterizing diseases and guiding efficient cohort recruitment in fat-data health applications.

Abstract

This paper presents the Relevance Feature and Vector Machine (RFVM), a novel model that addresses the challenges of the fat-data problem when dealing with clinical prospective studies. The fat-data problem refers to the limitations of Machine Learning (ML) algorithms when working with databases in which the number of features is much larger than the number of samples (a common scenario in certain medical fields). To overcome such limitations, the RFVM incorporates different characteristics: (1) A Bayesian formulation which enables the model to infer its parameters without overfitting thanks to the Bayesian model averaging. (2) A joint optimisation that overcomes the limitations arising from the fat-data characteristic by simultaneously including the variables that define the primal space (features) and those that define the dual space (observations). (3) An integrated prunning that removes the irrelevant features and samples during the training iterative optimization. Also, this last point turns out crucial when performing medical prospective studies, enabling researchers to exclude unnecessary medical tests, reducing costs and inconvenience for patients, and identifying the critical patients/subjects that characterize the disorder and, subsequently, optimize the patient recruitment process that leads to a balanced cohort. The model capabilities are tested against state-of-the-art models in several medical datasets with fat-data problems. These experimental works show that RFVM is capable of achieving competitive classification accuracies while providing the most compact subset of data (in both terms of features and samples). Moreover, the selected features (medical tests) seem to be aligned with the existing medical literature.
Paper Structure (18 sections, 133 equations, 2 figures, 5 tables)

This paper contains 18 sections, 133 equations, 2 figures, 5 tables.

Figures (2)

  • Figure 1: Diagram of the graphical model of RFVM for classification tasks. Grey circles denote observed variables, and white circles unobserved random variables. The nodes without a circle correspond to the hyperparameters. Also, the top-left plate, which factorizes over $D$ represents the FS capability, while the top-right plate, which factorizes over $\tilde{N}$, represents the RVS. The central plate factorizes over $N$, what allows the model to consider independence between samples.
  • Figure 2: Computational cost of RFVM as a function of the number of input features. The dotted lines represent different complexities and the blue continuous line represents the average computational time of the proposed model. The shaded region surrounding the curve represents its standard deviation.