Table of Contents
Fetching ...

Flexible inference of evolutionary accumulation dynamics using uncertain observational data

Jessica Renz, Morten Brun, Iain G. Johnston

TL;DR

This article introduces HyperLAU, a new algorithm for hypercubic inference that makes it possible to use datasets including uncertainties for learning evolutionary pathways, and illustrated with a case study on multidrug resistance in tuberculosis, showing that HyperLAU allows more flexible data and provides new information about evolutionary pathways compared to existing approaches.

Abstract

Understanding and predicting evolutionary accumulation pathways is a key objective in many fields of research, ranging from classical evolutionary biology to diverse applications in medicine. In this context, we are often confronted with the problem that data is sparse and uncertain. To use the available data as best as possible, inference approaches that can handle this uncertainty are required. One way that allows us to use not only cross-sectional data, but also phylogenetic related and longitudinal data, is using `hypercubic inference' models. In this article we introduce HyperLAU, a new algorithm for hypercubic inference that makes it possible to use datasets including uncertainties for learning evolutionary pathways. Expanding the flexibility of accumulation modelling, HyperLAU allows us to infer dynamic pathways and interactions between features, even when large sets of particular features are unobserved across the source dataset. We show that HyperLAU is able to highlight the main pathways found by other tools, even when up to 50% of the features in the input data are uncertain. Additionally, we demonstrate how it can help to overcome possible biases that can occur then reducing the used data by excluding uncertain parts. We illustrate the approach with a case study on multidrug resistance in tuberculosis, showing that HyperLAU allows more flexible data and provides new information about evolutionary pathways compared to existing approaches.

Flexible inference of evolutionary accumulation dynamics using uncertain observational data

TL;DR

This article introduces HyperLAU, a new algorithm for hypercubic inference that makes it possible to use datasets including uncertainties for learning evolutionary pathways, and illustrated with a case study on multidrug resistance in tuberculosis, showing that HyperLAU allows more flexible data and provides new information about evolutionary pathways compared to existing approaches.

Abstract

Understanding and predicting evolutionary accumulation pathways is a key objective in many fields of research, ranging from classical evolutionary biology to diverse applications in medicine. In this context, we are often confronted with the problem that data is sparse and uncertain. To use the available data as best as possible, inference approaches that can handle this uncertainty are required. One way that allows us to use not only cross-sectional data, but also phylogenetic related and longitudinal data, is using `hypercubic inference' models. In this article we introduce HyperLAU, a new algorithm for hypercubic inference that makes it possible to use datasets including uncertainties for learning evolutionary pathways. Expanding the flexibility of accumulation modelling, HyperLAU allows us to infer dynamic pathways and interactions between features, even when large sets of particular features are unobserved across the source dataset. We show that HyperLAU is able to highlight the main pathways found by other tools, even when up to 50% of the features in the input data are uncertain. Additionally, we demonstrate how it can help to overcome possible biases that can occur then reducing the used data by excluding uncertain parts. We illustrate the approach with a case study on multidrug resistance in tuberculosis, showing that HyperLAU allows more flexible data and provides new information about evolutionary pathways compared to existing approaches.

Paper Structure

This paper contains 9 sections, 18 equations, 10 figures, 1 algorithm.

Figures (10)

  • Figure 1: HyperLAU workflow. Learning evolutionary trajectories on a hypercube, based on data that contains uncertainties. (A) Dataset (structure can be cross-sectional or longitudinal) that contains information about the presence (red/dark) or absence (green/gradient) of certain features. White boxes indicate missing/uncertain information. (B) Translation of the data into binary barcodes, 1 = presence of the feature, 0 = absence. The uncertain positions are represented by a '?'. (C) The dataset in the form of barcode pairs is given to the HyperLAU algorithm, which consists of the optimization of a likelihood function, whose calculation is based on linear algebra. (D) The HyperLAU algorithm learns the evolutionary pathways and outputs the transition probabilities and fluxes, which can be used to make predictions.
  • Figure 2: Visualization of evolutionary pathways inferred by HyperLAU based on some toy examples. Plots show inferred transition networks through the evolutionary state space from 000... (top) to 111... (bottom) (as in Figure \ref{['graph_abstr']}D). In all plots, the thickness of the edges represents the probability flux between the corresponding state nodes (all coloured edges have minimum 0.05). Coefficient of variation (CV) is illustrated by the colour. A: Toy dataset including the data points 0?0 - ?00, 10? - 1?0 and 11? - ?11, under model $F$ (i), model 1 (ii) and model 2 (iii), showing a clear path 000-100-110-111 supported in all cases. B: Artificially generated data, where pairs of features influence the occurrence of others (as used in aga_hypertraps-ct_2024). HyperLAU model $F$ (i) reproduces established evolutionary pathways from HyperTraPS-CT aga_hypertraps-ct_2024 (ii). The structure of these inferred pathways under model $F$ remains robust with around 40% of the data for a particular feature are made uncertain (iii). Inference using model 2 (more restricted, pairwise interaction) is forced to approximate the higher-order true dynamics (iv), and also remains robust when 40% of the data for a particular feature are made uncertain (v). Results from other artificial-obscuring protocols are shown in Figures \ref{['other_features_model-1']} and \ref{['other_features_model2']}. The likelihood traces throughout the optimisation processes are shown in Figures \ref{['lik_toy1']}-\ref{['lik_features_model2']}.
  • Figure 3: Visualisation of evolutionary pathways learned by HyperLAU based on the tuberculosis datasetcasali_evolution_2014. Plots show inferred transition networks through the evolutionary state space from 000... (top) to 111... (bottom) (as in Figure \ref{['graph_abstr']}D). In all plots, the thickness of the edges represents the probability flux between the corresponding state nodes (all coloured edges have minimum 0.05). Coefficient of variation (CV) is illustrated by the colour. Key states are labelled by the decimal representation of their binary labels (see below). A: Inference using original dataset with no uncertainties. B: Inference using artificially obscured dataset, where every position of the original dataset was replaced by a '?' with probability 0.5. Some key nodes: 256 = 0100000000 (RIF), 512 = 10000000000 (INH), 768 = 1100000000 (INH+RIF), 800 = 1100100000 (INH+RIF+STR), 864 = 1101100000 (INH+RIF+EMB+STR), 992 = 1111100000 (INH+RIF+PZA+EMB+STR).
  • Figure 4: Inference of anti-microbial evolution in tuberculosis based on a full dataset including uncertainties. (A) Visualisation of the used dataset from casali_evolution_2014 embedded in a phylogeny. Each row in the matrix corresponds to a bacterial isolate that is a tip in the phylogeny. Each column in the matrix describes resistance to a different drug: red fields in the profile represent missing data, white fields indicate the absence of resistance, black fields indicate the presence of resistance. (B) Illustration of the transitions (not the individual isolate profiles) present in this dataset. For each drug, pie segment colour describes the 'before' state of that drug in a transition, and the labels on the circumference describe the 'after' state. The size of each pie section gives the proportion of transitions for that before-after combination. (C) Transition network visualisation of the fluxes learned by HyperLAU through the evolutionary state space from 000... (top) to 111... (bottom) (as in Figure \ref{['graph_abstr']}D). The thickness of the edges represents the strength of the flux. The coefficient of variation (CV) is illustrated by the colour. Edges are labelled by the drug resistance that is gained in that step. Only edges with a flux $>0.01$ are shown. Illustration of the bootstrap uncertainty for the first steps is given in Figure \ref{['tb_features']}.
  • Figure 5: HyperLAU case studies in photosynthesis and reductive mitochondrial evolution. Transition networks inferred using HyperLAU, styled as in Fig. \ref{['results_independent']}C. A: $C_4$ photosynthesis case study, with data from williams2013phenotypic (and references therein). Features (where 'specificity' means cell-type specificity: (1) large bundle sheath (BS) cells; (2) GDC cell specificity; (3) vein spacing; (4) decarboxylase specificity; (5) PPDK specificity; (6) BS chloroplast number; (7) GDC abundance; (8) PPDK abundance); (9) decarboxylase abundance; (10) RuBisCO abundance. B: Mitochondrial reduction case study, with data from glastad2025convergent (and references therein). Features: electron transport chain complexes (1) CI; (2) CII; (3) CIII; (4) CIV; (5) CV; (6) pyruvate dehydrogenase; (7) mtDNA; (8) citric acid cycle steps; (9) iron-sulfur metabolism.
  • ...and 5 more figures

Theorems & Definitions (1)

  • Definition 2.1