Table of Contents
Fetching ...

X2Graph for Cancer Subtyping Prediction on Biological Tabular Data

Tu Bui, Mohamed Suliman, Aparajita Haldar, Mohammed Amer, Serban Georgescu

TL;DR

X2Graph introduces a KB-guided graph-transforming approach to cancer subtyping on small biological tabular datasets. By converting each row into a graph whose edges reflect prior knowledge and whose node features encode feature indices and values, the method leverages graph neural networks to mitigate overfitting in data-scarce settings. A late fusion of multiple KB-based models yields robust predictions across CNV, RNA, and Clinical data, with interpretability analyses linking top features to known cancer biology. The approach demonstrates state-of-the-art performance and offers a principled pathway to integrate external biological knowledge into tabular oncology data analyses.

Abstract

Despite the transformative impact of deep learning on text, audio, and image datasets, its dominance in tabular data, especially in the medical domain where data are often scarce, remains less clear. In this paper, we propose X2Graph, a novel deep learning method that achieves strong performance on small biological tabular datasets. X2Graph leverages external knowledge about the relationships between table columns, such as gene interactions, to convert each sample into a graph structure. This transformation enables the application of standard message passing algorithms for graph modeling. Our X2Graph method demonstrates superior performance compared to existing tree-based and deep learning methods across three cancer subtyping datasets.

X2Graph for Cancer Subtyping Prediction on Biological Tabular Data

TL;DR

X2Graph introduces a KB-guided graph-transforming approach to cancer subtyping on small biological tabular datasets. By converting each row into a graph whose edges reflect prior knowledge and whose node features encode feature indices and values, the method leverages graph neural networks to mitigate overfitting in data-scarce settings. A late fusion of multiple KB-based models yields robust predictions across CNV, RNA, and Clinical data, with interpretability analyses linking top features to known cancer biology. The approach demonstrates state-of-the-art performance and offers a principled pathway to integrate external biological knowledge into tabular oncology data analyses.

Abstract

Despite the transformative impact of deep learning on text, audio, and image datasets, its dominance in tabular data, especially in the medical domain where data are often scarce, remains less clear. In this paper, we propose X2Graph, a novel deep learning method that achieves strong performance on small biological tabular datasets. X2Graph leverages external knowledge about the relationships between table columns, such as gene interactions, to convert each sample into a graph structure. This transformation enables the application of standard message passing algorithms for graph modeling. Our X2Graph method demonstrates superior performance compared to existing tree-based and deep learning methods across three cancer subtyping datasets.

Paper Structure

This paper contains 17 sections, 3 equations, 5 figures, 5 tables, 1 algorithm.

Figures (5)

  • Figure 1: X2Graph converts each table row into a graph. The cell values become node features, while the edge connection comes from the KB. The (.) notation refers to other information that may be incorporated alongside the cell value, for example, the feature name or feature index. Note: not all cells at row $x_i$ may appear on the graph as in the case of $f_3$ and $f_8$ because they are not available in the KB above; also, $f_1$ is also dropped according to certain assumptions, e.g. value of 0 is not meaningful for the modeling task.
  • Figure 2: Subgraph visualization for three gene KBs. Each subgraph shows 1-hop neighbor connections centered at gene BRCA1.
  • Figure 3: (Top) PR curves and Average Precision (AP) of X2Graph and baselines on the three benchmarks across the 10-fold cross-validation test sets. Note: The curves for RIDGE are different from the rest because RIDGE is a feature selection method which only outputs the predicted class instead of probabilities for each class.
  • Figure 4: (a-c) Top k most importance features identified from X2Graph models for CNV, RNA and Clinical data. (d) Fractions of these features evidenced in literature.
  • Figure 5: Contributions of each classes and KB in multigraph fusion for (a) CNV and (b) RNA, averaged across 10-fold cross validation.