A Tsallis-Entropy Lens on Genetic Variation
Margarita Geleta, Daniel Mas Montserrat, Alexander G. Ioannidis
TL;DR
The paper tackles the bias of variance-based differentiation metrics toward common alleles by introducing the Tsallis-order $q$ F-statistic, $F_q$, which leverages Tsallis entropy to smoothly interpolate between a Shannon-differentiation ($q=1$) and the classical heterozygosity-based $F_{ ext{ST}}$ ($q=2$). It defines per-locus and genome-wide differentiation via absolute and relative quantities $\Delta_q$ and $F_q$, proves non-negativity and bounds, and shows that $q=2$ recovers $F_{ ext{ST}}$ while $q=1$ yields mutual information-based differentiation. The authors further develop One-vs-Rest (OVR) and Leave-One-Out (LOO) diagnostics to attribute regional structure and quantify each population's contribution, then validate the framework on Oceanian haplotypes and on simulations seeded from HGDP/1000 Genomes founders. Across both real data and simulations, $F_q$ captures how differentiation is shaped by rare vs. common variants and provides a finer-grained, auditable view of demographic events like founder effects, drift, and isolation–reconnection dynamics. This approach offers a practical, more nuanced complement to $F_{ ext{ST}}$ for population-structure summaries and simulator audits, especially when allele-frequency spectra are skewed.
Abstract
We introduce an information-theoretic generalization of the fixation statistic, the Tsallis-order $q$ F-statistic, $F_q$, which measures the fraction of Tsallis $q$-entropy lost within subpopulations relative to the pooled population. The family nests the classical variance-based fixation index $F_{\textbf{ST}}$ at $q{=}2$ and a Shannon-entropy analogue at $q{=}1$, whose absolute form equals the mutual information between alleles and population labels. By varying $q$, $F_q$ acts as a spectral differentiator that up-weights rare variants at low $q$, while $q{>}1$ increasingly emphasizes common variants, providing a more fine-grained view of differentiation than $F_{\textbf{ST}}$ when allele-frequency spectra are skewed. On real data (865 Oceanian genomes with 1,823,000 sites) and controlled genealogical simulations (seeded from 1,432 founders from HGDP and 1000 Genomes panels, with 322,216 sites), we show that $F_q$ in One-vs-Rest (OVR) and Leave-One-Out (LOO) modes provides clear attribution of which subpopulations drive regional structure, and sensitively timestamps isolation-migration events and founder effects. $F_q$ serves as finer-resolution complement for simulation audits and population-structure summaries.
