Table of Contents
Fetching ...

Entropy-Rank Ratio: A Novel Entropy-Based Perspective for DNA Complexity and Classification

Emmanuel Pio Pastore, Giuseppe Passarino, Peppino Sapia, Francesco De Rango

TL;DR

This work addresses the saturation of Shannon entropy for long DNA sequences by introducing the entropy-rank ratio $R$, which normalizes a sequence's entropy within the complete distribution of all sequences of the same length under fixed block parameters $T$ and $n$. By partitioning sequences into fixed-length blocks and non-overlapping $n$-tuples, the authors derive a distribution $G_{T,n}$ of block entropies and define $R$ as a percentile within the $N$-block mean entropy distribution, yielding a robust, comparable measure in $[0,1]$ even as traditional entropy saturates. They integrate $R$ into data augmentation for CNNs via ratio-guided cropping, comparing it against random, Kolmogorov-based, and entropy-based crops. On two datasets (viral genes and human genes with polynucleotide expansions) they demonstrate substantial gains in classification accuracy with lightweight CNN architectures, highlighting the practical potential of distribution-aware entropy for DNA sequence analysis and compact device deployment. Overall, the entropy-rank framework provides a normalized perspective on sequence complexity that enhances discriminability and paves the way for efficient, deployable genomic classifiers.

Abstract

Shannon entropy is widely used to measure the complexity of DNA sequences but suffers from saturation effects that limit its discriminative power for long uniform segments. We introduce a novel metric, the entropy rank ratio R, which positions a target sequence within the full distribution of all possible sequences of the same length by computing the proportion of sequences that have an entropy value equal to or lower than that of the target. In other words, R expresses the relative position of a sequence within the global entropy spectrum, assigning values close to 0 for highly ordered sequences and close to 1 for highly disordered ones. DNA sequences are partitioned into fixed-length subsequences and non-overlapping n-mer groups; frequency vectors become ordered integer partitions and a combinatorial framework is used to derive the complete entropy distribution. Unlike classical measures, R is a normalized, distribution-aware measure bounded in [0,1] at fixed (T,n), which avoids saturation to log2 4 and makes values comparable across sequences under the same settings. We integrate R into data augmentation for convolutional neural networks by proposing ratio-guided cropping techniques and benchmark them against random, entropy-based, and compression-based methods. On two independent datasets, viral genes and human genes with polynucleotide expansions, models augmented via R achieve substantial gains in classification accuracy using extremely lightweight architectures.

Entropy-Rank Ratio: A Novel Entropy-Based Perspective for DNA Complexity and Classification

TL;DR

This work addresses the saturation of Shannon entropy for long DNA sequences by introducing the entropy-rank ratio , which normalizes a sequence's entropy within the complete distribution of all sequences of the same length under fixed block parameters and . By partitioning sequences into fixed-length blocks and non-overlapping -tuples, the authors derive a distribution of block entropies and define as a percentile within the -block mean entropy distribution, yielding a robust, comparable measure in even as traditional entropy saturates. They integrate into data augmentation for CNNs via ratio-guided cropping, comparing it against random, Kolmogorov-based, and entropy-based crops. On two datasets (viral genes and human genes with polynucleotide expansions) they demonstrate substantial gains in classification accuracy with lightweight CNN architectures, highlighting the practical potential of distribution-aware entropy for DNA sequence analysis and compact device deployment. Overall, the entropy-rank framework provides a normalized perspective on sequence complexity that enhances discriminability and paves the way for efficient, deployable genomic classifiers.

Abstract

Shannon entropy is widely used to measure the complexity of DNA sequences but suffers from saturation effects that limit its discriminative power for long uniform segments. We introduce a novel metric, the entropy rank ratio R, which positions a target sequence within the full distribution of all possible sequences of the same length by computing the proportion of sequences that have an entropy value equal to or lower than that of the target. In other words, R expresses the relative position of a sequence within the global entropy spectrum, assigning values close to 0 for highly ordered sequences and close to 1 for highly disordered ones. DNA sequences are partitioned into fixed-length subsequences and non-overlapping n-mer groups; frequency vectors become ordered integer partitions and a combinatorial framework is used to derive the complete entropy distribution. Unlike classical measures, R is a normalized, distribution-aware measure bounded in [0,1] at fixed (T,n), which avoids saturation to log2 4 and makes values comparable across sequences under the same settings. We integrate R into data augmentation for convolutional neural networks by proposing ratio-guided cropping techniques and benchmark them against random, entropy-based, and compression-based methods. On two independent datasets, viral genes and human genes with polynucleotide expansions, models augmented via R achieve substantial gains in classification accuracy using extremely lightweight architectures.

Paper Structure

This paper contains 18 sections, 4 theorems, 47 equations, 10 figures, 4 tables, 7 algorithms.

Key Result

Theorem 2.1

Fix $T,n$ and let $\lambda=4^n$. For sequences $o,v$ with lengths $L_o,L_v$, write $L_o=N_oT+r_o$ and $L_v=N_vT+r_v$ with $0\le r_o,r_v<T$. Let $S_o$ (resp. $S_v$) be the mean block entropy over the $N_o$ (resp. $N_v$) full $T$-blocks of $o$ (resp. $v$). Form $w=o\|v$ and let $S_w$ be the mean over Moreover, with we have the bound In particular, when $r_o=r_v=0$ the identity $\theta=S_w$ holds

Figures (10)

  • Figure 1: Average entropy $S_w$ of a 1000-base random DNA sequence versus the number of non-overlapping subsequences $N$, computed with single bases ($n=1$). Each point is the mean over 50 independent random sequences; the decay is nonlinear and concave, and $S_w\to0$ once $T=1$ base.
  • Figure 2: Average entropy $S_w$ for the same 1000-base sequences, using triplets ($n=3$). Points are averaged over 50 sequences; the entropy falls steeply and reaches zero when blocks no longer contain a full triplet ($T<3$).
  • Figure 3: Shannon entropy $S$ of a 1000-base random DNA sequence versus the $n$-tuple length $n$ (no blocking, $N=1$), averaged over 50 sequences. The curve peaks near $n\approx6$ and then falls off as $4^n$ approaches the sample size.
  • Figure 4: Distribution $G$ of occurrences $O$ as a function of entropy values $S$ (plotted on the horizontal axis, arranged in increasing order from 0 to 2.0), using $T=20$ and $n=1$. As described earlier, $G$ is a discrete distribution.
  • Figure 5: Distribution of the six viral classes $training\_set.csv$ on the GC content--Kolmogorov Complexity (based on zlib compression) plane.
  • ...and 5 more figures

Theorems & Definitions (23)

  • Definition 2.1
  • Definition 2.2
  • Definition 2.3
  • Definition 2.4
  • Definition 2.5
  • Definition 2.6
  • Definition 2.7
  • Definition 2.8
  • Definition 2.9
  • Definition 2.10
  • ...and 13 more