Table of Contents
Fetching ...

Prediction of Lung Metastasis from Hepatocellular Carcinoma using the SEER Database

Jeff J. H. Kim, George R. Nahass, Yang Dai, Theja Tulabandhula

TL;DR

This study tackles predicting lung metastasis in hepatocellular carcinoma (HCC) using SEER data by building an end-to-end ML pipeline that evaluates four classifiers—XGBoost, logistic regression, random forest, and a multilayer perceptron (MLP)—along with SMOTE oversampling and a recall-focused loss to boost sensitivity. The best performing models achieve AUROCs around 0.82, and an ensemble approach improves recall, albeit at the expense of precision due to data imbalance; feature importance highlights surgery status, tumor staging, and follow-up as key predictors. The work advances risk-stratified surveillance for high-risk HCC patients while acknowledging limitations from imbalanced data, incomplete annotations, and the need for more sophisticated data imputation and higher-quality labeled data. Future directions include expanding to larger, more diverse SEER cohorts and integrating pre-trained transformer-based architectures to further enhance predictive accuracy and clinical utility.

Abstract

Hepatocellular carcinoma (HCC) is a leading cause of cancer-related mortality, with lung metastases being the most common site of distant spread and significantly worsening prognosis. Despite the growing availability of clinical and demographic data, predictive models for lung metastasis in HCC remain limited in scope and clinical applicability. In this study, we develop and validate an end-to-end machine learning pipeline using data from the Surveillance, Epidemiology, and End Results (SEER) database. We evaluated three machine learning models (Random Forest, XGBoost, and Logistic Regression) alongside a multilayer perceptron (MLP) neural network. Our models achieved high AUROC values and recall, with the Random Forest and MLP models demonstrating the best overall performance (AUROC = 0.82). However, the low precision across models highlights the challenges of accurately predicting positive cases. To address these limitations, we developed a custom loss function incorporating recall optimization, enabling the MLP model to achieve the highest sensitivity. An ensemble approach further improved overall recall by leveraging the strengths of individual models. Feature importance analysis revealed key predictors such as surgery status, tumor staging, and follow up duration, emphasizing the relevance of clinical interventions and disease progression in metastasis prediction. While this study demonstrates the potential of machine learning for identifying high-risk patients, limitations include reliance on imbalanced datasets, incomplete feature annotations, and the low precision of predictions. Future work should leverage the expanding SEER dataset, improve data imputation techniques, and explore advanced pre-trained models to enhance predictive accuracy and clinical utility.

Prediction of Lung Metastasis from Hepatocellular Carcinoma using the SEER Database

TL;DR

This study tackles predicting lung metastasis in hepatocellular carcinoma (HCC) using SEER data by building an end-to-end ML pipeline that evaluates four classifiers—XGBoost, logistic regression, random forest, and a multilayer perceptron (MLP)—along with SMOTE oversampling and a recall-focused loss to boost sensitivity. The best performing models achieve AUROCs around 0.82, and an ensemble approach improves recall, albeit at the expense of precision due to data imbalance; feature importance highlights surgery status, tumor staging, and follow-up as key predictors. The work advances risk-stratified surveillance for high-risk HCC patients while acknowledging limitations from imbalanced data, incomplete annotations, and the need for more sophisticated data imputation and higher-quality labeled data. Future directions include expanding to larger, more diverse SEER cohorts and integrating pre-trained transformer-based architectures to further enhance predictive accuracy and clinical utility.

Abstract

Hepatocellular carcinoma (HCC) is a leading cause of cancer-related mortality, with lung metastases being the most common site of distant spread and significantly worsening prognosis. Despite the growing availability of clinical and demographic data, predictive models for lung metastasis in HCC remain limited in scope and clinical applicability. In this study, we develop and validate an end-to-end machine learning pipeline using data from the Surveillance, Epidemiology, and End Results (SEER) database. We evaluated three machine learning models (Random Forest, XGBoost, and Logistic Regression) alongside a multilayer perceptron (MLP) neural network. Our models achieved high AUROC values and recall, with the Random Forest and MLP models demonstrating the best overall performance (AUROC = 0.82). However, the low precision across models highlights the challenges of accurately predicting positive cases. To address these limitations, we developed a custom loss function incorporating recall optimization, enabling the MLP model to achieve the highest sensitivity. An ensemble approach further improved overall recall by leveraging the strengths of individual models. Feature importance analysis revealed key predictors such as surgery status, tumor staging, and follow up duration, emphasizing the relevance of clinical interventions and disease progression in metastasis prediction. While this study demonstrates the potential of machine learning for identifying high-risk patients, limitations include reliance on imbalanced datasets, incomplete feature annotations, and the low precision of predictions. Future work should leverage the expanding SEER dataset, improve data imputation techniques, and explore advanced pre-trained models to enhance predictive accuracy and clinical utility.
Paper Structure (19 sections, 3 equations, 7 figures, 1 table)

This paper contains 19 sections, 3 equations, 7 figures, 1 table.

Figures (7)

  • Figure 1: Flowchart illustrating the inclusion and exclusion criteria for the patient population studied, as extracted from the Surveillance, Epidemiology, and End Results (SEER) database.
  • Figure 2: Correlation heatmap displaying the relationships between clinical and demographic variables in the SEER dataset. The color scale represents the strength and direction of the correlation, with red indicating positive correlations and blue indicating negative correlations. The intensity of the color corresponds to the magnitude of the correlation coefficient, with values shown within the cells.
  • Figure 3: Receiver Operating Characteristic (ROC) curves for machine learning models predicting metastasis status. The curves illustrate the trade-off between true positive rate (sensitivity) and false positive rate for XGBoost, random forest, and logistic regression models.
  • Figure 4: Confusion matrix for logistic regression, XGBoost, random forest, and MLP models on metastasis classification.
  • Figure 5: Top 20 feature importance scores from the XGBoost model for predicting metastasis status. The most significant features include "Surgery Performed," "Follow-up Duration," and "Pretreatment AFP Normal," which have the highest contributions to model predictions.
  • ...and 2 more figures