Entropy-Rank Ratio: A Novel Entropy-Based Perspective for DNA Complexity and Classification

Emmanuel Pio Pastore; Giuseppe Passarino; Peppino Sapia; Francesco De Rango

Entropy-Rank Ratio: A Novel Entropy-Based Perspective for DNA Complexity and Classification

Emmanuel Pio Pastore, Giuseppe Passarino, Peppino Sapia, Francesco De Rango

TL;DR

This work addresses the saturation of Shannon entropy for long DNA sequences by introducing the entropy-rank ratio $R$, which normalizes a sequence's entropy within the complete distribution of all sequences of the same length under fixed block parameters $T$ and $n$. By partitioning sequences into fixed-length blocks and non-overlapping $n$-tuples, the authors derive a distribution $G_{T,n}$ of block entropies and define $R$ as a percentile within the $N$-block mean entropy distribution, yielding a robust, comparable measure in $[0,1]$ even as traditional entropy saturates. They integrate $R$ into data augmentation for CNNs via ratio-guided cropping, comparing it against random, Kolmogorov-based, and entropy-based crops. On two datasets (viral genes and human genes with polynucleotide expansions) they demonstrate substantial gains in classification accuracy with lightweight CNN architectures, highlighting the practical potential of distribution-aware entropy for DNA sequence analysis and compact device deployment. Overall, the entropy-rank framework provides a normalized perspective on sequence complexity that enhances discriminability and paves the way for efficient, deployable genomic classifiers.

Abstract

Shannon entropy is widely used to measure the complexity of DNA sequences but suffers from saturation effects that limit its discriminative power for long uniform segments. We introduce a novel metric, the entropy rank ratio R, which positions a target sequence within the full distribution of all possible sequences of the same length by computing the proportion of sequences that have an entropy value equal to or lower than that of the target. In other words, R expresses the relative position of a sequence within the global entropy spectrum, assigning values close to 0 for highly ordered sequences and close to 1 for highly disordered ones. DNA sequences are partitioned into fixed-length subsequences and non-overlapping n-mer groups; frequency vectors become ordered integer partitions and a combinatorial framework is used to derive the complete entropy distribution. Unlike classical measures, R is a normalized, distribution-aware measure bounded in [0,1] at fixed (T,n), which avoids saturation to log2 4 and makes values comparable across sequences under the same settings. We integrate R into data augmentation for convolutional neural networks by proposing ratio-guided cropping techniques and benchmark them against random, entropy-based, and compression-based methods. On two independent datasets, viral genes and human genes with polynucleotide expansions, models augmented via R achieve substantial gains in classification accuracy using extremely lightweight architectures.

Entropy-Rank Ratio: A Novel Entropy-Based Perspective for DNA Complexity and Classification

TL;DR

This work addresses the saturation of Shannon entropy for long DNA sequences by introducing the entropy-rank ratio

, which normalizes a sequence's entropy within the complete distribution of all sequences of the same length under fixed block parameters

and

. By partitioning sequences into fixed-length blocks and non-overlapping

-tuples, the authors derive a distribution

of block entropies and define

as a percentile within the

-block mean entropy distribution, yielding a robust, comparable measure in

even as traditional entropy saturates. They integrate

into data augmentation for CNNs via ratio-guided cropping, comparing it against random, Kolmogorov-based, and entropy-based crops. On two datasets (viral genes and human genes with polynucleotide expansions) they demonstrate substantial gains in classification accuracy with lightweight CNN architectures, highlighting the practical potential of distribution-aware entropy for DNA sequence analysis and compact device deployment. Overall, the entropy-rank framework provides a normalized perspective on sequence complexity that enhances discriminability and paves the way for efficient, deployable genomic classifiers.

Entropy-Rank Ratio: A Novel Entropy-Based Perspective for DNA Complexity and Classification

TL;DR

Abstract

Entropy-Rank Ratio: A Novel Entropy-Based Perspective for DNA Complexity and Classification

TL;DR

Abstract

Paper Structure

Table of Contents

Key Result

Figures (10)

Theorems & Definitions (23)