Solving the Tree Containment Problem Using Graph Neural Networks

Arkadiy Dushatskiy; Esther Julien; Leen Stougie; Leo van Iersel

Solving the Tree Containment Problem Using Graph Neural Networks

Arkadiy Dushatskiy, Esther Julien, Leen Stougie, Leo van Iersel

TL;DR

This paper tackles the NP-complete Tree Containment problem in phylogenetics by introducing Combine-GNN, a method that merges a phylogenetic network and a tree into a single display graph and processes it with a direction-aware Graph Neural Network (Dir-GNN). The approach enables inductive learning, generalizing to larger leaf sets than those seen during training, and delivers high accuracy (over $95\%$) on instances with up to 100 leaves. Empirical results show Combine-GNN outperforms baselines, including a feature-based XGBoost and a Siamese GNN that uses leaf labels, while maintaining robust performance in inductive scenarios and offering favorable runtime characteristics compared to exact algorithms. The work highlights the potential of GNNs to address complex phylogenetic problems and outlines avenues for extending to non-binary networks and related containment tasks.

Abstract

Tree Containment is a fundamental problem in phylogenetics useful for verifying a proposed phylogenetic network, representing the evolutionary history of certain species. Tree Containment asks whether the given phylogenetic tree (for instance, constructed from a DNA fragment showing tree-like evolution) is contained in the given phylogenetic network. In the general case, this is an NP-complete problem. We propose to solve it approximately using Graph Neural Networks. In particular, we propose to combine the given network and the tree and apply a Graph Neural Network to this network-tree graph. This way, we achieve the capability of solving the tree containment instances representing a larger number of species than the instances contained in the training dataset (i.e., our algorithm has the inductive learning ability). Our algorithm demonstrates an accuracy of over $95\%$ in solving the tree containment problem on instances with up to 100 leaves.

Solving the Tree Containment Problem Using Graph Neural Networks

TL;DR

) on instances with up to 100 leaves. Empirical results show Combine-GNN outperforms baselines, including a feature-based XGBoost and a Siamese GNN that uses leaf labels, while maintaining robust performance in inductive scenarios and offering favorable runtime characteristics compared to exact algorithms. The work highlights the potential of GNNs to address complex phylogenetic problems and outlines avenues for extending to non-binary networks and related containment tasks.

Abstract

in solving the tree containment problem on instances with up to 100 leaves.

Paper Structure (19 sections, 3 equations, 10 figures, 4 tables)

This paper contains 19 sections, 3 equations, 10 figures, 4 tables.

Introduction
Related work
Preliminaries
Phylogenetic networks
Tree containment problem
Solving the tree containment problem using a GNN
The summary of our approach
Node features
Time complexity
Experiments
Data
Baselines
GNN architecture and training hyperparameters tuning
Performance evaluation
Results
...and 4 more sections

Figures (10)

Figure 1: Examples of the tree containment problem instances: a tree (a) and two networks, one (b) does contain the tree, and one (c) does not. Leaves are labeled with letters (representing a set of taxa). The mapping between the edges of the tree (a) and the paths of the network (b) (as explained in Section \ref{['sec:containment']}) is depicted with the dashed curves with of corresponding colors. For the second network (c) no such mapping exists.
Figure 2: The schematic illustration of our approach (Combine-GNN). To solve a tree containment problem for the given network and tree, we start by combining them in a single graph such that the leaves are shared, respecting their labels. Then, a GNN is applied. We note that the input node features do not contain the leaves' labels. The GNN consists of multiple layers, then the obtained embeddings are concatenated, and a node aggregation (graph readout) is applied. Finally, an MLP is used to produce the final prediction.
Figure 3: Main results: performance of our algorithm and the baselines in terms of (balanced) accuracy on different datasets; the left part of the graph shows the results in the inductive setting, the right part of the graph shows the results in the transductive setting. For each dataset and each algorithm, we perform five runs with different seeds. Bar height denotes the average values; error bars denote the $95\%$ confidence interval.
Figure 4: Performance (balanced accuracy) for test datasets containing instances with increasing number of leaves. The leftmost point in each line shows the performance in the transductive setup, the remaining ones show the performance in the inductive setup. The number of leaves in the test instances are sampled randomly uniformly from the specified interval. The results are averaged over five runs with different seeds.
Figure 5: Performance (balanced accuracy) for training datasets of different sizes. The results are averaged over five inductive and five transductive learning setups with a different number of leaves and five runs with different seeds for each of them.
...and 5 more figures

Solving the Tree Containment Problem Using Graph Neural Networks

TL;DR

Abstract

Solving the Tree Containment Problem Using Graph Neural Networks

Authors

TL;DR

Abstract

Table of Contents

Figures (10)