Orchard: building large cancer phylogenies using stochastic combinatorial search

E. Kulman; R. Kuang; Q. Morris

Orchard: building large cancer phylogenies using stochastic combinatorial search

E. Kulman, R. Kuang, Q. Morris

TL;DR

Orchard addresses the challenge of reconstructing large cancer phylogenies from bulk sequencing data by framing MSPP with a tractable factorized posterior $Q^{\pi}(B|D)$ and sampling mutation trees sequentially using Ancestral Gumbel-Top-k. It introduces a supervariant approximation to handle large mutation counts and develops a phylogeny-aware clustering step to infer subclones from the reconstructed trees. Empirical results on 90 simulated cancers and 14 B-ALL data show that Orchard consistently achieves better data fit and more plausible tree topologies, scaling to up to 1000 mutations and enabling downstream analysis of somatic mutation patterns. The approach yields practical benefits for identifying subclonal structure and informs downstream clinical and evolutionary insights, while acknowledging runtime challenges at very large scales and suggesting avenues for local-update improvements and probabilistic refinement of clustering.

Abstract

Phylogenies depicting the evolutionary history of genetically heterogeneous subpopulations of cells from the same cancer, i.e., cancer phylogenies, offer valuable insights about cancer development and guide treatment strategies. Many methods exist that reconstruct cancer phylogenies using point mutations detected with bulk DNA sequencing. However, these methods become inaccurate when reconstructing phylogenies with more than 30 mutations, or, in some cases, fail to recover a phylogeny altogether. Here, we introduce Orchard, a cancer phylogeny reconstruction algorithm that is fast and accurate using up to 1000 mutations. Orchard samples without replacement from a factorized approximation of the posterior distribution over phylogenies, a novel result derived in this paper. Each factor in this approximate posterior corresponds to a conditional distribution for adding a new mutation to a partially built phylogeny. Orchard optimizes each factor sequentially, generating a sequence of incrementally larger phylogenies that ultimately culminate in a complete tree containing all mutations. Our evaluations demonstrate that Orchard outperforms state-of-the-art cancer phylogeny reconstruction methods in reconstructing more plausible phylogenies across 90 simulated cancers and 14 B-progenitor acute lymphoblastic leukemias (B-ALLs). Remarkably, Orchard accurately reconstructs cancer phylogenies using up to 1,000 mutations. Additionally, we demonstrate that the large and accurate phylogenies reconstructed by Orchard are useful for identifying patterns of somatic mutations and genetic variations among distinct cancer cell subpopulations.

Orchard: building large cancer phylogenies using stochastic combinatorial search

TL;DR

Orchard addresses the challenge of reconstructing large cancer phylogenies from bulk sequencing data by framing MSPP with a tractable factorized posterior

and sampling mutation trees sequentially using Ancestral Gumbel-Top-k. It introduces a supervariant approximation to handle large mutation counts and develops a phylogeny-aware clustering step to infer subclones from the reconstructed trees. Empirical results on 90 simulated cancers and 14 B-ALL data show that Orchard consistently achieves better data fit and more plausible tree topologies, scaling to up to 1000 mutations and enabling downstream analysis of somatic mutation patterns. The approach yields practical benefits for identifying subclonal structure and informs downstream clinical and evolutionary insights, while acknowledging runtime challenges at very large scales and suggesting avenues for local-update improvements and probabilistic refinement of clustering.

Abstract

Paper Structure (46 sections, 62 equations, 13 figures, 5 tables, 1 algorithm)

This paper contains 46 sections, 62 equations, 13 figures, 5 tables, 1 algorithm.

The cancer-specific mixed sample perfect phylogeny problem
The input to Orchard
Mutation tree representations of perfect phylogenies
Motivating the approximate posterior
Orchard algorithm
Background on Ancestral Gumbel-Top-k trick
Computing the "shifted" perturbed log probabilities of extensions
Stochastic beam search
Approximating the MLE of the clonal proportion matrix
Approximating the complete data likelihood
Phylogeny-aware clustering
Evaluation Overview
Orchard produces better reconstructions than competing methods on 90 simulated cancers
Orchard reconstructs more plausible trees on 14 B-progenitor acute lymphoblastic leukemias
Calculating Orchard's Inputs
...and 31 more sections

Figures (13)

Figure 1: Example of Orchard's mutation tree search with k=2, $f=\infty$. Mutation trees are depicted using genotype matrices. Search begins with a genotype matrix $B^{(1)}$ containing the first mutation in $\pi$. During each iteration, the best tree $t^{(\ell)}$ is popped from the queue and extended. The extensions are scored and reintroduced into the queue. Only the $k$ trees with the highest scores in the queue are kept, while others are discarded. The bars next to each genotype matrix indicate its perturbed log probability, $G_{\phi}$. Bars with grey fill correspond to the top-$k$ trees that are retained and extended. Genotype matrices within dashed boxes denote parts of the search space that are not explored further. Orchard's best reconstructed tree can then be input into the phylogeny-aware clustering algorithm. This algorithm conducts agglomerative clustering on the mutation trees to produce a set of clone trees. Each clone tree's set of clones is scored, and the algorithm yields the clone tree that minimizes the Generalized Information Criterion (GIC). See Section \ref{['sec:pac']} and Appendix \ref{['appendix:phylogeny-aware-clustering']} for complete details.
Figure 2: Evaluation of reconstructions for 90 simulated mutation trees. Results are grouped by the size of the simulated mutation trees (rows), i.e., the problem size. a. Bar plots show the percentage of data sets where a method produces at least one valid tree. A red $x$ means the method did not succeed on any of the data sets for that problem size. A red arrow means the results for the method on a problem size occur beyond the x-axis limit. The distributions, represented by box plots, in (b,c,d) only include data sets where the method was successful. b. The distribution of log perplexity ratios, a measure of VAF data fit. Ratios are relative to the perplexity of the ground truth mutation frequency matrix $F^{(\text{true})}$, and can be negative. Lower log perplexity ratios indicate better reconstructions. c. The distributions of relationship reconstruction loss for each method on a problem size. This loss can range between zero bits (complete match of pairwise relationships) and one bit (complete mismatch of pairwise relationships). d. The distributions of wall clock run time.
Figure 3: Evaluation of reconstructions by Orchard and Pairtree for SJBALL022611. a. Log perplexity ratio for the trees reconstructed by Orchard and Pairtree as a function of the number of samples. Orchard's reconstructions are accurate regardless of the number of samples provided, while Pairtree's reconstructions worsen with more samples. b,c. Absolute difference between the VAFs inferred by Orchard and Pairtree and the VAFs implied by the bulk data. Large values indicate divergence between VAFs inferred by a method and the VAFs implied by the data. VAFs inferred by Orchard adhere very closely to the data, while also adhering to the ISA. Pairtree's poorly reconstructed tree results in innaccurate VAF estimates for many mutations. The same row in each heatmap corresponds to same unique mutation, and each column corresponds to the same unique sample.
Figure A4: Counterexample where $Q^{\pi}(B|D)$ may fail to find an optimal tree even if mutations are added before their ancestors. a. The mutation order $\pi$ guarantees that mutations are added before their ancestors. b. Adding mutation $2$ as a descendant of $1$ adheres perfectly to the ISA, but if the data for mutation $3$ is observed then it must be the case that mutation $1$ and $2$ are on separate branches. c. The breakdown of the plausible pairwise relationships between each pair of mutations $\left\{(1,2), (1,3), (2,3)\right\}$. There's only one possible mutation tree structure implied by these pairwise relationships, and $Q^{\pi}(B|D)$ may fail to recover it.
Figure A5: The percentage of extensions of each rank that led to the tree with the largest likelihood across all data sets in the validation set. The validation set consisted of 200 simulated cancers originally from myers_calder_2019. The grey dotted line represents the percentage of extensions that we would expect for each rank if they were randomly chosen, assuming only a linear tree structure. The red dotted line is a Gaussian distribution fit to the rank data.
...and 8 more figures

Theorems & Definitions (2)

Definition A4.1: $F$ Properties
Definition A6.1: Pairwise Evolutionary Relationships

Orchard: building large cancer phylogenies using stochastic combinatorial search

TL;DR

Abstract

Orchard: building large cancer phylogenies using stochastic combinatorial search

Authors

TL;DR

Abstract

Table of Contents

Figures (13)

Theorems & Definitions (2)