Table of Contents
Fetching ...

A Tsallis-Entropy Lens on Genetic Variation

Margarita Geleta, Daniel Mas Montserrat, Alexander G. Ioannidis

TL;DR

The paper tackles the bias of variance-based differentiation metrics toward common alleles by introducing the Tsallis-order $q$ F-statistic, $F_q$, which leverages Tsallis entropy to smoothly interpolate between a Shannon-differentiation ($q=1$) and the classical heterozygosity-based $F_{ ext{ST}}$ ($q=2$). It defines per-locus and genome-wide differentiation via absolute and relative quantities $\Delta_q$ and $F_q$, proves non-negativity and bounds, and shows that $q=2$ recovers $F_{ ext{ST}}$ while $q=1$ yields mutual information-based differentiation. The authors further develop One-vs-Rest (OVR) and Leave-One-Out (LOO) diagnostics to attribute regional structure and quantify each population's contribution, then validate the framework on Oceanian haplotypes and on simulations seeded from HGDP/1000 Genomes founders. Across both real data and simulations, $F_q$ captures how differentiation is shaped by rare vs. common variants and provides a finer-grained, auditable view of demographic events like founder effects, drift, and isolation–reconnection dynamics. This approach offers a practical, more nuanced complement to $F_{ ext{ST}}$ for population-structure summaries and simulator audits, especially when allele-frequency spectra are skewed.

Abstract

We introduce an information-theoretic generalization of the fixation statistic, the Tsallis-order $q$ F-statistic, $F_q$, which measures the fraction of Tsallis $q$-entropy lost within subpopulations relative to the pooled population. The family nests the classical variance-based fixation index $F_{\textbf{ST}}$ at $q{=}2$ and a Shannon-entropy analogue at $q{=}1$, whose absolute form equals the mutual information between alleles and population labels. By varying $q$, $F_q$ acts as a spectral differentiator that up-weights rare variants at low $q$, while $q{>}1$ increasingly emphasizes common variants, providing a more fine-grained view of differentiation than $F_{\textbf{ST}}$ when allele-frequency spectra are skewed. On real data (865 Oceanian genomes with 1,823,000 sites) and controlled genealogical simulations (seeded from 1,432 founders from HGDP and 1000 Genomes panels, with 322,216 sites), we show that $F_q$ in One-vs-Rest (OVR) and Leave-One-Out (LOO) modes provides clear attribution of which subpopulations drive regional structure, and sensitively timestamps isolation-migration events and founder effects. $F_q$ serves as finer-resolution complement for simulation audits and population-structure summaries.

A Tsallis-Entropy Lens on Genetic Variation

TL;DR

The paper tackles the bias of variance-based differentiation metrics toward common alleles by introducing the Tsallis-order F-statistic, , which leverages Tsallis entropy to smoothly interpolate between a Shannon-differentiation () and the classical heterozygosity-based (). It defines per-locus and genome-wide differentiation via absolute and relative quantities and , proves non-negativity and bounds, and shows that recovers while yields mutual information-based differentiation. The authors further develop One-vs-Rest (OVR) and Leave-One-Out (LOO) diagnostics to attribute regional structure and quantify each population's contribution, then validate the framework on Oceanian haplotypes and on simulations seeded from HGDP/1000 Genomes founders. Across both real data and simulations, captures how differentiation is shaped by rare vs. common variants and provides a finer-grained, auditable view of demographic events like founder effects, drift, and isolation–reconnection dynamics. This approach offers a practical, more nuanced complement to for population-structure summaries and simulator audits, especially when allele-frequency spectra are skewed.

Abstract

We introduce an information-theoretic generalization of the fixation statistic, the Tsallis-order F-statistic, , which measures the fraction of Tsallis -entropy lost within subpopulations relative to the pooled population. The family nests the classical variance-based fixation index at and a Shannon-entropy analogue at , whose absolute form equals the mutual information between alleles and population labels. By varying , acts as a spectral differentiator that up-weights rare variants at low , while increasingly emphasizes common variants, providing a more fine-grained view of differentiation than when allele-frequency spectra are skewed. On real data (865 Oceanian genomes with 1,823,000 sites) and controlled genealogical simulations (seeded from 1,432 founders from HGDP and 1000 Genomes panels, with 322,216 sites), we show that in One-vs-Rest (OVR) and Leave-One-Out (LOO) modes provides clear attribution of which subpopulations drive regional structure, and sensitively timestamps isolation-migration events and founder effects. serves as finer-resolution complement for simulation audits and population-structure summaries.

Paper Structure

This paper contains 23 sections, 12 equations, 2 figures.

Figures (2)

  • Figure 1: Regional differentiation profiles $\bm{F_q}$ (OVR) and $\bm{\Delta F_q}$ (LOO) across the Pacific and Southeast Asia.(A, B, D, E): One–vs–Rest $F_q$ (left of each pair) and leave–one–out influence $\Delta F_q^{\text{LOO}}$ (right of each pair) for Polynesia, Micronesia, Melanesia, and Southeast Asia. Lines show Tsallis $q$-entropy colored by population; shaded ribbons are bootstrap 95% CIs from resampling. Equal–country weighting is used within each macroregion to reduce sample–size imbalance. (C): locator map with sampling sites (black points) and region polygons used for grouping.
  • Figure 2: Time–series behavior of $\bm{F_q}$ (OVR) and $\bm{\Delta F_q}$ (LOO) under controlled mating policies.(A.1, B.1): One–vs–Rest (OVR) $F_q$ across generations for three demes (WA, EA, CSN); solid lines: $q{=}2$ (heterozygosity/second–order), dashed: $q{=}1$ (Shannon/first–order). (A.2, B.2): Leave–one–out (LOO) influence $\Delta F_q^{\text{LOO}}$, measuring each deme's contribution to between–deme differentiation. (A.3, B.3): Haplotype sample counts per deme generated at each generation. (A.3):Baseline drift with deme–specific random–mating probabilities $\rho$ changed at generation 8 (red dashed line) from $(\rho_{\textsf{WA}},\rho_{\textsf{EA}},\rho_{\textsf{CSN}})=(0.3,0.5,0.1)$ to $(0.1,0.6,0.3)$. (B.3):Isolation–reconnection with $\rho$ set to $(0.5,0.5,0.5)$ initially, then near–isolation $(0.05,0.05,0.05)$ at generation 8 and strong exogamy $(0.9,0.9,0.9)$ at generation 14 (red dashed lines). Curves are genome–wide micro–averages with equal weights across demes.