Horizon-wise Learning Paradigm Promotes Gene Splicing Identification
Qi-Jie Li, Qian Sun, Shao-Qun Zhang
TL;DR
This paper tackles gene splicing identification by introducing a horizon-wise paradigm that predicts all positions in a sequence with a single forward pass. It presents H-GSI, a four-component framework consisting of a six-mer tokenizer for pre-processing, a sliding window to manage long sequences, SeqLab-based sequence labeling models, and a predictor that aggregates overlapping outputs via averaging and thresholds. Empirical results on a real Human dataset show H-GSI variants, especially with LSTM/GRU/Transformer backbones, outperforming SpliceAI-10k across multiple metrics, with dynamic thresholds further boosting performance. The work demonstrates improved accuracy and efficiency for long-range sequence modeling in splicing identification and discusses future directions in explainability and knowledge-based integration.
Abstract
Identifying gene splicing is a core and significant task confronted in modern collaboration between artificial intelligence and bioinformatics. Past decades have witnessed great efforts on this concern, such as the bio-plausible splicing pattern AT-CG and the famous SpliceAI. In this paper, we propose a novel framework for the task of gene splicing identification, named Horizon-wise Gene Splicing Identification (H-GSI). The proposed H-GSI follows the horizon-wise identification paradigm and comprises four components: the pre-processing procedure transforming string data into tensors, the sliding window technique handling long sequences, the SeqLab model, and the predictor. In contrast to existing studies that process gene information with a truncated fixed-length sequence, H-GSI employs a horizon-wise identification paradigm in which all positions in a sequence are predicted with only one forward computation, improving accuracy and efficiency. The experiments conducted on the real-world Human dataset show that our proposed H-GSI outperforms SpliceAI and achieves the best accuracy of 97.20\%. The source code is available from this link.
