Table of Contents
Fetching ...

Determining distances and consensus between mutation trees

Luís Cunha, Jack Kuipers, Thiago Lopes

TL;DR

It is proved that both problems are NP-complete, even with only three input trees, and a fast way to find consensus trees with better results than any tree in the input set while still preserving all internal structure is shown.

Abstract

The mutational heterogeneity of tumours can be described with a tree representing the evolutionary history of the tumour. With noisy sequencing data there may be uncertainty in the inferred tree structure, while we may also wish to study patterns in the evolution of cancers in different patients. In such situations, understanding tree similarities is a key challenge, and therefore we present an approach to determine distances between trees. Considering the bounded height of trees, we determine the distances associated with the swap operations over strings. While in general, by solving the {\sc Maximum Common Almost $v$-tree} problem between two trees, we describe an efficient approach to determine the minimum number of operations to transform one tree into another. The inherent noise in current statistical methods for constructing mutation evolution trees of cancer cells presents a significant challenge: handling such collections of trees to determine a consensus tree that accurately represents the set and evaluating the extent of their variability or dispersion. Given a set of mutation trees and the notion of distance, there are at least two natural ways to define the ``target'' tree, such as a min-sum (\emph{median tree}) or a min-max (\emph{closest tree}) of a set of trees. Thus, considering a set of trees as input and dealing with the {\sc median} and {\sc closest} problems, we prove that both problems are \NP-complete, even with only three input trees. In addition, we develop algorithms to obtain upper bounds on the median and closest solutions, which are analysed by the experiments presented on generated and on real databases. We show a fast way to find consensus trees with better results than any tree in the input set while still preserving all internal structure.

Determining distances and consensus between mutation trees

TL;DR

It is proved that both problems are NP-complete, even with only three input trees, and a fast way to find consensus trees with better results than any tree in the input set while still preserving all internal structure is shown.

Abstract

The mutational heterogeneity of tumours can be described with a tree representing the evolutionary history of the tumour. With noisy sequencing data there may be uncertainty in the inferred tree structure, while we may also wish to study patterns in the evolution of cancers in different patients. In such situations, understanding tree similarities is a key challenge, and therefore we present an approach to determine distances between trees. Considering the bounded height of trees, we determine the distances associated with the swap operations over strings. While in general, by solving the {\sc Maximum Common Almost -tree} problem between two trees, we describe an efficient approach to determine the minimum number of operations to transform one tree into another. The inherent noise in current statistical methods for constructing mutation evolution trees of cancer cells presents a significant challenge: handling such collections of trees to determine a consensus tree that accurately represents the set and evaluating the extent of their variability or dispersion. Given a set of mutation trees and the notion of distance, there are at least two natural ways to define the ``target'' tree, such as a min-sum (\emph{median tree}) or a min-max (\emph{closest tree}) of a set of trees. Thus, considering a set of trees as input and dealing with the {\sc median} and {\sc closest} problems, we prove that both problems are \NP-complete, even with only three input trees. In addition, we develop algorithms to obtain upper bounds on the median and closest solutions, which are analysed by the experiments presented on generated and on real databases. We show a fast way to find consensus trees with better results than any tree in the input set while still preserving all internal structure.
Paper Structure (27 sections, 14 theorems, 3 equations, 11 figures, 2 tables)

This paper contains 27 sections, 14 theorems, 3 equations, 11 figures, 2 tables.

Key Result

Proposition 1.2.1

The number of internal nodes of a tree is equal to the number of bracket pairs.

Figures (11)

  • Figure 1: (a) Representation of tumour evolution. Each star represents a new mutation and a expansion of a subclone. The circles represents single cells sequenced after tumour removal and the stars inside which mutation is present in each cell. (b) Mutation matrix representing the mutation status of the sequenced tumour cells. A zero entry denotes the absence of a mutation in the respective cell, while a one denotes its presence. $0$, $1$ and ? denote false negative, false positive and missing data, respectively, that may occur in a real scenario. (c) Ideal mutation matrix representing the mutation status of the sequenced tumour cells.
  • Figure 2: Rooted trees and their Newick format strings of length $5$. (a) Tree associated to $((1,2)((4,3)5))$, which is equal to the tree depicted in (b) associated to $((1,2)((3,4)5))$. Both trees are distinct to the tree depicted in (c) associated to $((1,5)((3,4)2))$.
  • Figure 3: (a) Illustration from $(i)$ to $(iv)$ of moves associated with elements $a$ and $b$ from $B_1$ to $B_2$. The moves of $c$ and $d$ follow similarly. (b) Illustration of some moves associated with layers ($l_i$ is the $i$th layer for $i = 1, \ldots, J$), where in a bottom-up process we combine leaves from $v_1$ to $v_k$ and move them all together from $B_1$ and $B_2$.
  • Figure 4: Illustration of some moves associated with layers, where in a bottom-up process we combine leaves from $v_1$ to $v_k$ and move them all together from $B_1$ and $B_2$.
  • Figure 5: $(i)$ The solution tree for MCAT is shown in blue, with its root as node $v$, and $v_1$ and $v_2$ as its children. Note that there is no node $u$ in $T$ such that it is not in the solution and its path to reach $v$ passes through a descendant of $v$. Therefore, the node $u$ shown in the figure does not exist, so it is safe to contract the subtree in blue. $(ii)$ The node in blue is the representative node of $H_v^{v_1, \cdots, v_j}$ after its contraction. Moreover, each descendant of $v$ that is not part of the MCAT solution keeps the same as in $(i)$, as well as the node $z$.
  • ...and 6 more figures

Theorems & Definitions (37)

  • Proposition 1.2.1
  • Proposition 1.3.1: $\star$
  • proof
  • Lemma 1.3.1: $\star$
  • proof
  • Theorem 1.3.1: $\star$
  • proof
  • Theorem 1.3.2: $\star$
  • proof
  • Lemma 1.3.2
  • ...and 27 more