Genomics, bioinformatics, and quantitative methods in biology
Stochastic chemical reaction networks (SRNs) in cellular systems are commonly modeled as continuous-time Markov chains (CTMCs) describing the dynamics of molecular copy numbers. The exact evaluation of transient copy number statistics is, however, often hindered by a non-closed hierarchy of moment equations. In this paper, we propose a method for computing theoretically guaranteed upper and lower bounds on transient moments based on the Kolmogorov's backward equation, which provides a dual representation of the CME, the governing equation for the probability distribution of the CTMC. This dual formulation avoids the moment closure problem by shifting the source of infinite dimensionality to the dependence on the initial state. We show that, this dual formulation, combined with the monotonicity of the CTMC generator, leads to a finite-dimensional linear time-invariant system that provides bounds on transient moments. The resulting system enables efficient evaluation of moment bounds across multiple initial conditions by simple inner-product operations without recomputing the bounding system. Further, for certain classes of SRNs, the bounding ODEs admit explicit construction from the reaction model, providing a systematic and constructive framework for computing provable bounds.
Objective: SNP heritability estimates vary substantially across estimation strategies, yet the downstream consequences for polygenic risk score (PRS) construction remain poorly characterised. We systematically benchmarked heritability estimation configurations and assessed their propagation into downstream PRS performance. Methods: We benchmarked 86 heritability-estimation configurations spanning six tool families (GEMMA, GCTA, LDAK, DPR, LDSC, and SumHer) and ten method groups across 10 UK Biobank phenotypes, yielding 844 configuration-level estimates. Each estimate was propagated into GCTA-SBLUP and LDpred2-lassosum2 PRS frameworks and evaluated across five cross-validation folds using null, PRS-only, and full models. Eleven binary analytical contrasts were tested using Mann-Whitney U tests to identify drivers of heritability variability. Results: Heritability ranged from -0.862 to 2.735 (mean = 0.134, SD = 0.284), with 133 of 844 estimates (15.8%) being negative and concentrated in unconstrained estimation regimes. Ten of eleven analytical contrasts significantly affected heritability magnitude, with algorithm choice and GRM standardisation showing the largest effects. Despite this upstream variability, downstream PRS test performance was only weakly coupled to heritability magnitude: pooled Pearson correlations between h^2 and test AUC were r = -0.023 for GCTA-SBLUP and r = +0.014 for LDpred2-lassosum2, with both being non-significant. Conclusion: SNP heritability is best interpreted as a configuration-sensitive modelling parameter rather than a universally stable scalar input. Heritability estimates should always be reported alongside their full estimation specification, and downstream PRS performance is comparatively robust to moderate variation in the heritability input.
Effectively modeling irregularly sampled longitudinal data is essential for understanding disease progression and improving risk prediction. We propose a two-view mixture model that integrates static baseline covariates and longitudinal biomarker trajectories within a unified probabilistic clustering framework. Temporal patterns are modeled using Neural Ordinary Differential Equations. Model training uses an EM algorithm with a sparsity-inducing log-penalty for interpretable subgroup discovery. Application of the model to an Irish cohort of ANCA-associated vasculitis patients reveals subgroups with heterogeneous serum creatinine trajectories and variation in end-stage kidney disease outcomes.
The evolutionary and ecological dynamics of tumors under immune responses and therapeutic interventions pose major challenges to long-term treatment success. Although treatment may initially achieve short-term disease control, resistant cancer cell subpopulations often arise, leading to relapse with more aggressive and treatment-resistant forms of the disease. Here, we develop and analyze mathematical models describing the interactions among effector cells, chemo-resistant tumor cells, and immuno-resistant tumor cells under distinct immune-evasion strategies. The models incorporate competition and cooperation between resistant and sensitive tumor subpopulations. We identify threshold conditions governing tumor persistence, elimination, and phenotype dominance under varying therapeutic intensities. These findings provide a theoretical framework for designing targeted and combination therapies and offer insights into strategies for mitigating the treatment resistance.
Trajectory inference is a critical problem in single-cell transcriptomics, which aims to reconstruct the dynamic process underlying a population of cells from sequencing data. Of particular interest is the reconstruction of differentiation trees. One way of doing this is by estimating the path distance between nodes -- labeled by cells -- based on cell similarities observed in the sequencing data. Recent sequencing techniques make it possible to measure two types of data: gene expression levels, and RNA velocity, a vector that quantifies variation in gene expression. The sequencing data then consist in a discrete vector field in dimension the number of genes of interest. In this article, we present a novel method for inferring differentiation trees from RNA velocity fields using a distance-based approach. In particular, we introduce a cell dissimilarity measure defined as the squared varifold distance between the integral curves of the RNA velocity field, which we show is a robust estimate of the path distance on the target differentiation tree. Upstream of the dissimilarity measure calculation, we also implement comprehensive routines for the preprocessing and integration of the RNA velocity field. Finally, we illustrate the ability of our method to recover differentiation trees with high accuracy on several simulated and real datasets, and compare these results with the state of the art.
Prion-like propagation of misfolded proteins is a key mechanism underlying the progression of neurodegenerative diseases such as Alzheimer's disease. In previous work, we introduced the HeMiTo framework, describing these prion-like dynamics for a class of heterodimer models in terms of three phases: the healthy (He), mixed (Mi), and toxic (To) phases. While the He-phase was characterised analytically, the Mi-phase was described numerically and the To-phase was inferred from linear stability arguments. In this work, we provide a complete analytical characterisation of the Mi- and To-phases for our class of heterodimer models. We derive exact inner solutions governing the Mi-phase and match them with outer solutions from the He-phase, explaining the concave-like behaviour of the healthy species and establishing explicit conditions for exponential growth of the toxic species with a mechanistically interpretable growth rate. Furthermore, we formalise a quasi steady-state reduction near the toxic steady state and show that the dynamics reduce to a logistic growth equation, linking exponential growth to saturation. Together, these results provide a unified and mechanistic description of prion-like dynamics across all phases of disease progression and establish a foundation for predictive modelling of biomarker trajectories.
Mammalian sleep is characterized by multiple alternations between episodes of rapid-eye-movement sleep (REMS) and non-REM sleep (NREMS). While the mechanisms governing the timing of these ultradian NREMS-REMS cycles remain poorly understood, the phenomenon of REMS pressure, namely a drive for REMS that builds up between REMS episodes, is thought to be a contributing factor. Prior analyses of NREMS-REMS cycles in mice has suggested that time in NREMS is a primary contributor to REMS pressure. Building on that finding, we previously introduced a REMS propensity measure defined as the probability to enter REMS before the accumulation of an additional amount of NREMS. Analyzing mouse ultradian cycle data, we showed that REMS propensity at REMS onset was positively correlated with REMS bout duration and with the probability of the occurrence of a REMS bout followed by a short inter-REMS interval, called a sequential REMS cycle. In this paper, we extend our analyses of REMS propensity to human and rat ultradian NREMS-REMS cycle data. We show that, as in mice, human and rat sleep contain both short NREMS-REMS sequential cycles and longer single NREMS-REMS cycles, though there are some differences in the relative distributions of cycle durations. Although rodents exhibit polyphasic sleep in contrast with the consolidated sleep of humans, the calculated REMS propensity measures in all three species show similar profiles as functions of time spent in NREMS: specifically, REMS propensity increases with time spent in NREMS until it reaches a peak value, and then it decays with additional time in NREMS. Positive correlations of REMS propensity at REMS onset with REMS bout duration were present in both human and rat data as in mouse data, suggesting that time spent in NREMS also influences REMS duration in these species.
Mathematical models of natural and man-made systems often have many adjustable parameters that must be estimated from multiple, potentially conflicting datasets. Rather than reporting a single best-fit parameter vector, it is often more informative to generate an ensemble of parameter sets that collectively map out the trade-offs among competing objectives. This paper presents ParetoEnsembles.jl, an open-source Julia package that generates such ensembles using Pareto Optimal Ensemble Techniques (POETs), a simulated-annealing-based algorithm that requires no gradient information. The implementation corrects the original dominance relation from weak to strict Pareto dominance, reduces the per-iteration ranking cost from $O(n^2 m)$ to $O(nm)$ through an incremental update scheme, and adds multi-chain parallel execution for improved front coverage. We demonstrate the package on a cell-free gene expression model fitted to experimental data and a blood coagulation cascade model with ten estimated rate constants and three objectives. A controlled synthetic-data study reveals parameter identifiability structure, with individual rate constants off by several-fold yet model predictions accurate to 7%. A five-replicate coverage analysis confirms that timing features are reliably covered while peak amplitude is systematically overconfident. Validation against published experimental thrombin generation data demonstrates that the ensemble predicts held-out conditions to within 10% despite inherent model approximation error. By making ensemble generation lightweight and accessible, ParetoEnsembles.jl aims to lower the barrier to routine uncertainty characterization in mechanistic modeling.
Functional evidence is essential for clinical interpretation of genomic variants, but identifying relevant studies and translating experimental results into structured evidence remains labor intensive. We developed a benchmark based on ClinGen curated annotations to evaluate two large language models (LLMs), a non reasoning model (gpt-4o-mini) and a reasoning model (o4-mini), on tasks relevant to functional evidence curation: (1) abstract screening to determine whether a study reports functional experiments directly testing specific variants, and (2) full text evidence extraction and classification from matched variant-paper pairs, including interpretation of evidence direction and generation of evidence summaries. Starting from ClinGen variants annotated with functional evidence, we processed curator comments with an LLM to extract PubMed identifiers, evidence labels, and narrative, and retrieved titles, abstracts, and open access PDFs to construct variant-paper pairs. In abstract screening, both models achieved high recall (0.88-0.90) with moderate specificity (0.59-0.65). For full text evidence classification under an explicit variant matching gate, o4-mini achieved 96% accuracy and higher specificity (0.83 vs. 0.37) while maintaining high F1 (0.98 vs. 0.96) compared with gpt-4o-mini. We also used an LLM-as-judge protocol to compare model generated evidence summaries with expert curator comments. Finally, we developed AcmGENTIC, an end to end pipeline that expands variant identifiers, retrieves literature via LitVar2, filters abstracts with LLMs, acquires PDFs, performs multimodal evidence extraction, and generates evidence reports for curator review, with optional agentic parsing of figures and tables. Together, this benchmark and pipeline provide a practical framework for scaling functional evidence curation with human in the loop LLM assistance.
FcsIT is a platform-independent, open-source tool for calculating the correlation and fitting fluorescence correlation spectroscopy data. The software is written in Python and uses a powerful Dear PyGUI engine for its interface. It provides reading and correlating the TTTR data, as well as TCSPC filtering of the photon time-trace data. The circular-block bootstrap method applied to the calculation of correlation data and its variance results in data quality comparable to that obtained with commercially available software. An intuitive fitting interface provides efficient analysis of large datasets and includes nine predefined mathematical models for fitting correlation curves. Moreover, it allows users to add their own models in a user-friendly manner. Validation of the FcsIT tool against simulated FCS data and real FCS experiments confirms its usability and potential appeal to a wide variety of FCS users.
Multi-omic datasets offer opportunities for improved biomarker discovery in cancer research, but their high dimensionality and limited sample sizes make identifying compact and effective biomarker panels challenging. Feature selection in large-scale omics can be efficiently addressed by combining machine learning with genetic algorithms, which naturally support multi-objective optimization of predictive accuracy and biomarker set size. However, genetic algorithms remain relatively underexplored for multi-omic feature selection, where most approaches concatenate all layers into a single feature space. To address this limitation, we introduce Sweeping*, a multi-view, multi-objective algorithm alternating between single- and multi-view optimization. It employs a nested single-view multi-objective optimizer, and for this study we use the genetic algorithm NSGA3-CHS. It first identifies informative biomarkers within each layer, then jointly evaluates cross-layer interactions; these multi-omic solutions guide the next single-view search. Through repeated sweeps, the algorithm progressively identifies compact biomarker panels capturing cross-modal complementary signals. We benchmark five Sweeping* strategies, including hierarchical and concatenation-based variants, using survival prediction on three TCGA cohorts. Each strategy jointly optimizes predictive accuracy and set size, measured via the concordance index and root-leanness. Overall performance and estimation error are assessed through cross hypervolume and Pareto delta under 5-fold cross-validation. Our results show that Sweeping* can improve the accuracy-complexity trade-off when sufficient survival signal is present and that integrating omic layers can enhance survival prediction beyond clinical-only models, although benefits remain cohort-dependent.
Cardiovascular-Kidney-Metabolic (CKM) syndrome represents a growing public health crisis, yet the subclinical heterogeneity of its component systems remains underexplored. Early detection of physiological deviation is critical for preventing irreversible organ damage and mortality. Here, we characterize the prevalence and interplay of CKM impairment in a US cohort (N=841) by integrating continuous wearable data with clinical biomarkers. We assessed cardiovascular, kidney via clinical biomarkers, namely Chol/HDL, eGFR, as well as metabolic health risk through Homeostatic Model Assessment of Insulin Resistance (HOMA-IR). We show that while metabolic and cardiovascular disruptions are significantly associated (r=0.26, p<0.001), early-stage kidney impairment manifests independently. Utilizing a normalized deviance score, we identified significant health impairments in 29.0% of the cohort. Cardiovascular deviation was the most prevalent singular phenotype (13.3%), followed by metabolic (9.1%) and renal (6.25%) deviations, with dual metabolic-cardiovascular impairment occurring in only 2.2% of participants. These findings suggest that high system-specific deviance may serve as an indicator for accelerated physiological aging within the respective organ system. Furthermore, feature ablation analysis revealed that step count, Active Zone Minutes, and resting heart rate are the most potent wearable-derived predictors of cardiovascular and metabolic decline. These findings underscore the necessity of a multi-system subtyping approach, demonstrating that wearable-derived phenotypes can facilitate the early, targeted interventions required to manage the complex landscape of CKM syndrome.
Quantitatively mapping three-dimensional (3D) flow, diffusion, and particle density in crowded living cells remains challenging because most dynamic optical microscopy measurements are effectively planar and existing analysis methods struggle with dense, noisy volumetric data. We introduce volumetric spatio-temporal image correlation spectroscopy (vSTICS), a framework that recovers voxel-resolved flow, diffusion coefficients, and particle densities from 3D fluorescence time series. Growing Camellia japonica pollen tubes were imaged with field-synthesis lattice light-sheet microscopy, and localized 3D spatio-temporal correlation analysis was applied to overlapping volumetric samples to generate maps of velocity, diffusion, and density. Validation with synthetic flow-diffusion simulations showed accurate recovery of seeded transport parameters, including velocities near $3$ $μ$m s$^{-1}$ and diffusion near $10^{-3}$ $μ$m$^2$ s$^{-1}$. Fluorescent microsphere experiments verified particle number and point spread function readouts and measured diffusion coefficients of $0.3 \pm 0.1$ $μ$m$^2$ s$^{-1}$ in gel, consistent with imaging-FCS measurements of $0.5 \pm 0.2$ $μ$m$^2$ s$^{-1}$. Applied to mitochondria in pollen tubes, vSTICS resolved a bidirectional reverse-fountain pattern with slower anterograde transport ($0.1$-$1$ $μ$m s$^{-1}$) and faster retrograde motion peaking near $3$ $μ$m s$^{-1}$, plus a retrograde corridor about $2$ $μ$m wide. Density and diffusion maps indicated a denser, more advective core and higher peripheral diffusion. High-density sub-diffraction vesicle mapping produced similar velocity landscapes with about ten-fold higher particle densities. These results establish vSTICS as a practical method for quantitative 3D mapping of intracellular transport and refines the reverse-fountain model by revealing asymmetric, predominantly transverse circulation.
Genomic foundation models trained on DNA sequences have demonstrated remarkable capabilities across diverse biological tasks, from variant effect prediction to genome design. These models are typically trained on massive, publicly sourced genomic datasets comprising trillions of nucleotide tokens, which renders them intrinsically susceptible to errors, artifacts, and adversarial issues embedded in the training data. Unlike natural language, DNA sequences lack the semantic transparency that might allow model makers to filter out corrupted entries, making genomic training corpora particularly susceptible to undetected manipulation. While training data poisoning has been established as a credible threat to large language models, its implications for genomic foundation models remain unexplored. Here, we present the first systematic investigation of training data poisoning in genomic language models. We demonstrate two complementary attack vectors. First, we show that adversarially crafted sequences can selectively degrade generative behavior on targeted genomic contexts, with backdoor activation following a sigmoidal dose-response relationship and full implantation achieved at 1 percent cumulative poison exposure. Second, targeted label corruption of downstream training data can selectively compromise clinically relevant variant classification, demonstrated using BRCA1 variant effect prediction. Our results reveal that genomic foundation models are vulnerable to targeted data poisoning attacks, underscoring the need for data provenance tracking, integrity verification, and adversarial robustness evaluation in the genomic foundation model development pipeline.
As immunotherapies become standard cancer treatments, it is increasingly important to identify a patient's immune profile, which encompasses the activity of immune cells within the tumor microenvironment and the presence of specific biomarkers. However, we lack mechanistic explanations drivers of immune phenotypes. Despite advances in immune profiling with high-throughput sequencing, the mechanisms driving them remain unclear. This study aimed to identify novel, robust immune-related gene clusters (metagenes) and evaluate their prognostic significance and functional relevance across various pan-cancer types using a comprehensive computational pipeline. We acquired pan-cancer bulk RNA-seq and established immune subtypes from The Cancer Genome Atlas (TCGA). Using expression-based filtering and clustering of genes with ANOVA and Gaussian Mixture Model (GMM), we identified 48 unique metagenes. These metagenes achieved 87% accuracy in predicting the established subtypes. SHAP analysis revealed the most predictive metagenes per subtype, while functional enrichment analysis identified their associated pathways. Genes were ranked by differential expression between high- and low-expression groups. The metagenes revealed insights, including co-expression of immune activation and regulatory factors, links between cell cycle regulation and immune evasion, and dynamic microenvironment remodeling signatures. Kaplan-Meier survival analysis and multivariate Cox Regression revealed that many metagenes had prognostic value for overall survival. Overall, the metagenes represent coordinated biological programs across diverse cancer types, providing a foundation for developing robust, broadly applicable immuno-oncology biomarkers that extend beyond single-gene markers. They demonstrate prognostic value across cancer types and hold potential to guide immunotherapy treatment decisions.
Body Mass Index (BMI) is a widely accessible but imprecise proxy of cardiometabolic health. While assessing true body composition is superior, gold-standard methods like Dual-Energy X-ray Absorptiometry (DXA) are not scalable. We address this gap by developing and validating "PhotoScan," a method to estimate body composition from smartphone imagery. We pretrained a deep learning model on UK Biobank participants (N=35,323) and fine-tuned on a newly recruited clinical cohort (PhotoBIA cohort, N=677) with diverse ethnicity, age, and body fat distribution, achieving high accuracy against DXA for total body fat percentage (BF%, MAE = 2.15%), Android-to-Gynoid fat ratio (A/G, MAE = 0.11), and visceral-to-subcutaneous fat area ratio (V/S, MAE = 0.09). Generalizability of the model was demonstrated on an independent metabolic health study cohort (MetabolicMosaic cohort, N=132 participants), achieving MAEs of 2.13% for BF%, 0.09 for A/G, and 0.09 for V/S. We then evaluated the clinical utility of these metrics in the MetabolicMosaic cohort by predicting insulin resistance (IR). Adding PhotoScan-derived body composition metrics to baseline demographics model (Age, Sex, BMI) significantly improved insulin resistance classification (Area Under the Receiver Operating Characteristic Curve "AUROC" 76.0% vs 69.2%, DeLong test p=0.002, Net Reclassification Index "NRI" 0.593). Crucially, this accessible smartphone method achieved performance nearly equivalent to adding clinical-grade DXA data to baseline demographics model (AUROC 77.3% vs 69.2%, DeLong test p=0.004, NRI 0.748). These findings demonstrate that smartphone-based phenotyping captures clinically meaningful risk signals missed by BMI and anthropometrics, offering a scalable alternative to DXA for cardiometabolic risk stratification.
2603.26110The rapid scaling of Protein Language Models (PLMs) has unlocked unprecedented accuracy in protein structure prediction and design, but the quadratic memory growth of the Key-Value (KV) cache during inference remains a prohibitive barrier for single-GPU deployment and high-throughput generation. While 8-bit quantization is now standard, 3-bit quantization remains elusive due to severe numerical outliers in activations. This paper presents TurboESM, an adaptation of Google's TurboQuant to the PLM domain. We solve the fundamental incompatibility between Rotary Position Embeddings (RoPE) and orthogonal transformations by deriving a RoPE-first rotation pipeline. We introduce a head-wise SVD calibration method tailored to the amino acid activation manifold, a dual look-up table (LUT) strategy for asymmetric K/V distributions, and a 1-bit Quantized Johnson-Lindenstrauss (QJL) residual correction. All experiments are conducted on ESM-2 650M, where our implementation achieves a 7.1x memory reduction (330 MB to 47 MB) while maintaining cosine similarity > 0.96 in autoregressive decoding across diverse protein families, including short peptides, transmembrane helices, enzyme active site fragments, and intrinsically disordered regions. We further implement a Triton-based fused decode attention kernel that eliminates intermediate dequantization memory allocations, achieving a 1.96x speedup over the PyTorch two-step path for the KV fetch operation alone; however, TurboESM incurs a prefill overhead of 21-27 ms relative to the original model due to KV quantization and packing, making it most suitable for memory-bound scenarios rather than latency-critical short-sequence workloads. Analysis reveals that PLMs exhibit sharper outlier profiles than large language models (LLMs) due to amino acid vocabulary sparsity, and our method effectively addresses these distributions.
Protein structural ensembles from NMR spectroscopy capture biologically important conformational heterogeneity, but it remains difficult to determine whether observed variation reflects coordinated motion or noise-like artifacts. We evaluate the Spectral Coherence Index (SCI), a model-free, rotation-invariant summary derived from the participation-ratio effective rank of the inter-model pairwise distance-variance matrix. Under grouped primary analysis of a Main110 cohort of 110 NMR ensembles (30--403 residues; 10--30 models per entry), SCI separated experimental ensembles from matched synthetic incoherent controls with AUC-ROC $= 0.973$ and Cliff's $δ= -0.945$. Relative to an internal 27-protein pilot, discrimination softened modestly, showing that pilot-era thresholds do not transfer perfectly to a larger, more heterogeneous cohort: the primary operating point $τ= 0.811$ yielded 95.5\% sensitivity and 89.1\% specificity. PDB-level sensitivity remained nearly unchanged (AUC $= 0.972$), and an independent 11-protein holdout reached AUC $= 0.983$. Across 5-fold grouped stratified cross-validation and leave-one-function-class-out testing, SCI remained strong (AUC $= 0.968$ and $0.971$), although $σ_{R_g}$ was the stronger single-feature discriminator and a QC-augmented multifeature model generalized best (AUC $= 0.989$ and $0.990$). Residue-level validation linked SCI-derived contributions to experimental RMSF across 110 proteins and showed broad concordance with GNM-based flexibility patterns. Rescue analyses showed that Main110 softening arose mainly from size and ensemble normalization rather than from loss of spectral signal. Together, these results establish SCI as an interpretable, bounded coherence summary that is most useful when embedded in a multimetric QC workflow for heterogeneous protein ensembles.
Prediction of genetic biomarkers, e.g., microsatellite instability in colorectal cancer is crucial for clinical decision making. But, two primary challenges hamper accurate prediction: (1) It is difficult to construct a pathology-aware representation involving the complex interconnections among pathological components. (2) WSIs contain a large proportion of areas unrelated to genetic biomarkers, which make the model easily overfit simple but irrelative instances. We hereby propose a Dictionary-based hierarchical pathology mining with hard-instance-assisted classifier Debiasing framework to address these challenges, dubbed as D2Bio. Our first module, dictionary-based hierarchical pathology mining, is able to mine diverse and very fine-grained pathological contextual interaction without the limit to the distances between patches. The second module, hard-instance-assisted classfier debiasing, learns a debiased classifier via focusing on hard but task-related features, without any additional annotations. Experimental results on five cohorts show the superiority of our method, with over 4% improvement in AUROC compared with the second best on the TCGA-CRC-MSI cohort. Our analysis further shows the clinical interpretability of D2Bio in genetic biomarker diagnosis and potential clinical utility in survival analysis. Code will be available at https://github.com/DeepMed-Lab-ECNU/D2Bio.
We present efficient approaches for extracting spaced k-mers from nucleotide sequences. They are based on bit manipulation instructions at CPU level, making them both simpler to implement and up to an order of magnitude faster than existing methods. We further evaluate common pitfalls in k-mer processing, which can cause major inefficiencies. Combined, our approaches allow the utilization of spaced k-mers in high-performance bioinformatics applications without major performance degradation, offering a throughput of up to 750MB of sequence data per second per core. Availability: The implementation in C++20 is published under the MIT license, and freely available at https://github.com/lczech/fisk