Table of Contents
Fetching ...

Horizon-wise Learning Paradigm Promotes Gene Splicing Identification

Qi-Jie Li, Qian Sun, Shao-Qun Zhang

TL;DR

This paper tackles gene splicing identification by introducing a horizon-wise paradigm that predicts all positions in a sequence with a single forward pass. It presents H-GSI, a four-component framework consisting of a six-mer tokenizer for pre-processing, a sliding window to manage long sequences, SeqLab-based sequence labeling models, and a predictor that aggregates overlapping outputs via averaging and thresholds. Empirical results on a real Human dataset show H-GSI variants, especially with LSTM/GRU/Transformer backbones, outperforming SpliceAI-10k across multiple metrics, with dynamic thresholds further boosting performance. The work demonstrates improved accuracy and efficiency for long-range sequence modeling in splicing identification and discusses future directions in explainability and knowledge-based integration.

Abstract

Identifying gene splicing is a core and significant task confronted in modern collaboration between artificial intelligence and bioinformatics. Past decades have witnessed great efforts on this concern, such as the bio-plausible splicing pattern AT-CG and the famous SpliceAI. In this paper, we propose a novel framework for the task of gene splicing identification, named Horizon-wise Gene Splicing Identification (H-GSI). The proposed H-GSI follows the horizon-wise identification paradigm and comprises four components: the pre-processing procedure transforming string data into tensors, the sliding window technique handling long sequences, the SeqLab model, and the predictor. In contrast to existing studies that process gene information with a truncated fixed-length sequence, H-GSI employs a horizon-wise identification paradigm in which all positions in a sequence are predicted with only one forward computation, improving accuracy and efficiency. The experiments conducted on the real-world Human dataset show that our proposed H-GSI outperforms SpliceAI and achieves the best accuracy of 97.20\%. The source code is available from this link.

Horizon-wise Learning Paradigm Promotes Gene Splicing Identification

TL;DR

This paper tackles gene splicing identification by introducing a horizon-wise paradigm that predicts all positions in a sequence with a single forward pass. It presents H-GSI, a four-component framework consisting of a six-mer tokenizer for pre-processing, a sliding window to manage long sequences, SeqLab-based sequence labeling models, and a predictor that aggregates overlapping outputs via averaging and thresholds. Empirical results on a real Human dataset show H-GSI variants, especially with LSTM/GRU/Transformer backbones, outperforming SpliceAI-10k across multiple metrics, with dynamic thresholds further boosting performance. The work demonstrates improved accuracy and efficiency for long-range sequence modeling in splicing identification and discusses future directions in explainability and knowledge-based integration.

Abstract

Identifying gene splicing is a core and significant task confronted in modern collaboration between artificial intelligence and bioinformatics. Past decades have witnessed great efforts on this concern, such as the bio-plausible splicing pattern AT-CG and the famous SpliceAI. In this paper, we propose a novel framework for the task of gene splicing identification, named Horizon-wise Gene Splicing Identification (H-GSI). The proposed H-GSI follows the horizon-wise identification paradigm and comprises four components: the pre-processing procedure transforming string data into tensors, the sliding window technique handling long sequences, the SeqLab model, and the predictor. In contrast to existing studies that process gene information with a truncated fixed-length sequence, H-GSI employs a horizon-wise identification paradigm in which all positions in a sequence are predicted with only one forward computation, improving accuracy and efficiency. The experiments conducted on the real-world Human dataset show that our proposed H-GSI outperforms SpliceAI and achieves the best accuracy of 97.20\%. The source code is available from this link.
Paper Structure (15 sections, 10 equations, 8 figures, 2 tables)

This paper contains 15 sections, 10 equations, 8 figures, 2 tables.

Figures (8)

  • Figure 1: Graphical illustrations of gene expression procedure, which consists of these crucial steps: transcription, splicing, translation, and folding. Splicing occurs between transcription and translation, retaining the exons and cutting out the introns.
  • Figure 2: The workflow of our method H-GSI. Firstly, A nucleotide sequence of length $N$ is encoded into an integer vector of length $n$. Then the sliding window technique is applied and gives $m$ windows of size $w$ as a mini-batch. Embeddings of dimension $d$ are retrieved using integer vectors, as the input of the SeqLab model. The outputs are 6-dimensional real vectors indicating the logits. As a solution for overlapped outputs, the logits in overlapped segments are averaged. The prediction vectors are calculated via the sigmoid function and determined via fixed or dynamic thresholds. Finally, inference results are flattened from prediction vectors.
  • Figure 3: The difference between the point-wise identification paradigm and the horizon-wise identification paradigm.
  • Figure 4: Illustrations of (a) Encoding and (b) Prediction.
  • Figure 5: The histogram of sequence lengths in the Human dataset. The x-axis is log-transformed so that the distribution looks like Gaussian, but it is an intrinsically long-tailed distribution.
  • ...and 3 more figures