Table of Contents
Fetching ...

R-GAT: Cancer Document Classification Leveraging Graph-Based Residual Network for Scenarios with Limited Data

Elias Hossain, Tasfia Nuzhat, Shamsul Masum, Shahram Rahimi, Noorbakhsh Amiri Golilarz

TL;DR

This work tackles cancer-abstract classification under data and compute constraints. It introduces R-GAT, a residual graph attention network that leverages multi-head graph attention to model semantic and relational dependencies without relying on large-scale pretraining. Through systematic benchmarking against traditional ML, deep learning, and transformer baselines, the study demonstrates that R-GAT achieves competitive accuracy with reduced variance and computational cost, while also providing a publicly released dataset of 1,875 PubMed abstracts to support reproducibility. The results suggest that lightweight graph-based architectures can be robust and practical alternatives to transformers in data-limited biomedical NLP settings, broadening the methodological toolkit for cancer informatics.

Abstract

Accurate classification of cancer-related biomedical abstracts is critical for advancing cancer informatics and supporting decision-making in healthcare research. Yet progress in this domain is often constrained by limited availability of labeled corpora and the high computational demands of transformer-based approaches. To address these challenges, we propose a Residual Graph Attention Network (R-GAT) that integrates multi-head attention with residual connections to capture semantic and relational dependencies in biomedical texts. Evaluated on a curated dataset of 1,875 PubMed abstracts spanning thyroid, colon, lung, and generic cancer topics, R-GAT achieves stable and competitive performance, comparable to transformer-based models such as BioBERT and BioClinicalBERT and strong classical baselines like Logistic Regression, while requiring significantly fewer computational resources. Ablation studies confirm the importance of attention and residual connections in ensuring robustness under limited-data conditions. To support reproducibility and facilitate future research, we also release the curated dataset. Together, these contributions demonstrate the value of lightweight graph-based architectures as reliable and resource-efficient alternatives to computationally intensive transformers in biomedical NLP.

R-GAT: Cancer Document Classification Leveraging Graph-Based Residual Network for Scenarios with Limited Data

TL;DR

This work tackles cancer-abstract classification under data and compute constraints. It introduces R-GAT, a residual graph attention network that leverages multi-head graph attention to model semantic and relational dependencies without relying on large-scale pretraining. Through systematic benchmarking against traditional ML, deep learning, and transformer baselines, the study demonstrates that R-GAT achieves competitive accuracy with reduced variance and computational cost, while also providing a publicly released dataset of 1,875 PubMed abstracts to support reproducibility. The results suggest that lightweight graph-based architectures can be robust and practical alternatives to transformers in data-limited biomedical NLP settings, broadening the methodological toolkit for cancer informatics.

Abstract

Accurate classification of cancer-related biomedical abstracts is critical for advancing cancer informatics and supporting decision-making in healthcare research. Yet progress in this domain is often constrained by limited availability of labeled corpora and the high computational demands of transformer-based approaches. To address these challenges, we propose a Residual Graph Attention Network (R-GAT) that integrates multi-head attention with residual connections to capture semantic and relational dependencies in biomedical texts. Evaluated on a curated dataset of 1,875 PubMed abstracts spanning thyroid, colon, lung, and generic cancer topics, R-GAT achieves stable and competitive performance, comparable to transformer-based models such as BioBERT and BioClinicalBERT and strong classical baselines like Logistic Regression, while requiring significantly fewer computational resources. Ablation studies confirm the importance of attention and residual connections in ensuring robustness under limited-data conditions. To support reproducibility and facilitate future research, we also release the curated dataset. Together, these contributions demonstrate the value of lightweight graph-based architectures as reliable and resource-efficient alternatives to computationally intensive transformers in biomedical NLP.

Paper Structure

This paper contains 32 sections, 9 equations, 4 figures, 6 tables, 1 algorithm.

Figures (4)

  • Figure 1: End-to-end methodology for cancer abstract classification using the proposed R-GAT. The workflow is divided into four major phases: (1) Data Collection: Abstracts are retrieved from PubMed and curated into a medical document corpus. (2) Text Preprocessing: Cleaning operations include spelling correction, tokenization, and lemmatization, producing a high-quality dataset suitable for model training. (3) Graph Construction and R-GAT Model Architecture: Abstracts are represented as graphs, where nodes correspond to document features and edges capture relational dependencies. The adjacency matrix and feature vectors form the graph representation. This representation is processed through stacked Graph Attention (GAT) layers with non-linear activations, followed by a Residual Block consisting of three GAT layers and skip connections. The residual design mitigates information loss and stabilizes training. (4) Classification: Features are aggregated via a Global Average Pooling layer and passed through a fully connected layer with a Softmax decoder to predict four target categories: thyroid cancer, colon cancer, lung cancer, and generic biomedical abstracts.
  • Figure 2: Performance visualization of the proposed R-GAT for multi-cancer abstract classification. (a) Confusion matrix showing the distribution of predictions across the four cancer classes: Colon Cancer, Lung Cancer, Thyroid Cancer, and Generic. Values on the diagonal represent correct classifications, with R-GAT achieving high accuracy across all categories ($\geq$0.94), indicating balanced performance and minimal class-specific bias. Off-diagonal values reflect misclassifications, which remain relatively rare. (b) Training and validation loss curves plotted over 50 epochs for each fold of stratified 5-fold cross-validation. The consistently smooth convergence across all folds demonstrates stable learning behavior and low variance, reinforcing the robustness of the R-GAT model under limited-data conditions.
  • Figure 3: Cross-validation robustness analysis using F1-scores with 95% confidence intervals (error bars) for R-GAT, its ablated variants (GAT without residuals and GCN without attention and residuals), and baseline models (Logistic Regression and BioBERT). R-GAT achieves a macro-F1 of approximately 0.96 with the narrowest confidence intervals, indicating strong stability and consistent generalization across folds. In contrast, GCN shows wider intervals and lower mean performance, reflecting higher sensitivity to data partitioning. Logistic Regression and BioBERT achieve slightly higher absolute scores, but with greater computational demands (BioBERT) or dependence on specific feature representations (LogReg). These results emphasize that R-GAT balances robustness and efficiency, making it particularly suitable for limited-data biomedical classification scenarios.
  • Figure 4: Analysis of cancer abstracts fed into the R-GAT model for classification: (a) Thyroid Cancer—the model analyzed the abstract focused on the telomere-telomerase complex in both sporadic and familial thyroid cancer cases, emphasizing telomere shortening and telomerase activation; (b) Lung Cancer—the model processed an abstract detailing the effectiveness of nitrosoureas and other agents in treating various types of lung cancer, including oat cell carcinoma and adenocarcinoma. Both abstracts were correctly classified by the R-GAT model.