A Retrospective Systematic Study on Hierarchical Sparse Query Transformer-assisted Ultrasound Screening for Early Hepatocellular Carcinoma
Chaoyin She, Ruifang Lu, Danni He, Jiayi Lv, Yadan Lin, Meiqing Cheng, Hui Huang, Fengyu Ye, Lida Chen, Wei Wang, Qinghua Huang
TL;DR
This work addresses the limited sensitivity and variability in ultrasound screening for hepatocellular carcinoma by introducing the Hierarchical Sparse Query Transformer (HSQformer), a CNN-ViT hybrid that leverages latent-space representations and sparse Mixture-of-Experts to fuse local and global features without structural redundancy. The model processes multi-scale features from ConvNeXt and SwinTransformer, projects them into a latent space, and uses a four-stage HSQformer backbone with Cross-Self-attention Mixed Experts to achieve hierarchical, sparse feature integration. Across single-center, multi-center, and high-risk patient cohorts, HSQformer demonstrates state-of-the-art performance, matching senior radiologists and outperforming juniors, with notable AUC gains (e.g., 95.38% in multi-center testing). The approach offers robust generalization, potential standardization of HCC screening, and an open-source codebase to foster broader adoption and further research in AI-assisted ultrasound diagnostics.
Abstract
Hepatocellular carcinoma (HCC), ranking as the third leading cause of cancer-related mortality worldwide, demands urgent improvements in early detection to enhance patient survival. While ultrasound remains the preferred screening modality due to its cost-effectiveness and real-time capabilities, its sensitivity (59%-78%) heavily relies on radiologists' expertise, leading to inconsistent diagnostic outcomes and operational inefficiencies. Recent advancements in AI technology offer promising solutions to bridge this gap. This study introduces the Hierarchical Sparse Query Transformer (HSQformer), a novel hybrid architecture that synergizes CNNs' local feature extraction with Vision Transformers' global contextual awareness through latent space representation and sparse learning. By dynamically activating task-specific experts via a Mixture-of-Experts (MoE) framework, HSQformer achieves hierarchical feature integration without structural redundancy. Evaluated across three clinical scenarios: single-center, multi-center, and high-risk patient cohorts, HSQformer outperforms state-of-the-art models (e.g., 95.38% AUC in multi-center testing) and matches senior radiologists' diagnostic accuracy while significantly surpassing junior counterparts. These results highlight the potential of AI-assisted tools to standardize HCC screening, reduce dependency on human expertise, and improve early diagnosis rates. The full code is available at https://github.com/Asunatan/HSQformer.
