Table of Contents
Fetching ...

On the Approximation of Phylogenetic Distance Functions by Artificial Neural Networks

Benjamin K. Rosenzweig, Matthew W. Hahn

TL;DR

The paper tackles learning phylogenetic distances with neural networks when ground-truth data are scarce, focusing on distance-based approaches that feed a neighbor-joining step. It introduces symmetry-preserving, permutation-invariant architectures, including Euclidean embedding and inner-product variants, plus taxa-wise attention and spatially-aware components, designed for scalability to large taxa sets. Through simulations under JC, HKY, LG models with indels, the authors show that learned distances can rival or exceed simple pairwise metrics and approach state-of-the-art likelihood methods in many conditions, while maintaining a smaller computational footprint. They also provide theoretical motivations from Bourgain’s embedding theory and Mercer's theorem, discuss the tradeoffs of information sharing across taxa, and outline practical limitations and avenues for future enhancements, such as handling concatenated genomic blocks and long-branch attraction effects.

Abstract

Inferring the phylogenetic relationships among a sample of organisms is a fundamental problem in modern biology. While distance-based hierarchical clustering algorithms achieved early success on this task, these have been supplanted by Bayesian and maximum likelihood search procedures based on complex models of molecular evolution. In this work we describe minimal neural network architectures that can approximate classic phylogenetic distance functions and the properties required to learn distances under a variety of molecular evolutionary models. In contrast to model-based inference (and recently proposed model-free convolutional and transformer networks), these architectures have a small computational footprint and are scalable to large numbers of taxa and molecular characters. The learned distance functions generalize well and, given an appropriate training dataset, achieve results comparable to state-of-the art inference methods.

On the Approximation of Phylogenetic Distance Functions by Artificial Neural Networks

TL;DR

The paper tackles learning phylogenetic distances with neural networks when ground-truth data are scarce, focusing on distance-based approaches that feed a neighbor-joining step. It introduces symmetry-preserving, permutation-invariant architectures, including Euclidean embedding and inner-product variants, plus taxa-wise attention and spatially-aware components, designed for scalability to large taxa sets. Through simulations under JC, HKY, LG models with indels, the authors show that learned distances can rival or exceed simple pairwise metrics and approach state-of-the-art likelihood methods in many conditions, while maintaining a smaller computational footprint. They also provide theoretical motivations from Bourgain’s embedding theory and Mercer's theorem, discuss the tradeoffs of information sharing across taxa, and outline practical limitations and avenues for future enhancements, such as handling concatenated genomic blocks and long-branch attraction effects.

Abstract

Inferring the phylogenetic relationships among a sample of organisms is a fundamental problem in modern biology. While distance-based hierarchical clustering algorithms achieved early success on this task, these have been supplanted by Bayesian and maximum likelihood search procedures based on complex models of molecular evolution. In this work we describe minimal neural network architectures that can approximate classic phylogenetic distance functions and the properties required to learn distances under a variety of molecular evolutionary models. In contrast to model-based inference (and recently proposed model-free convolutional and transformer networks), these architectures have a small computational footprint and are scalable to large numbers of taxa and molecular characters. The learned distance functions generalize well and, given an appropriate training dataset, achieve results comparable to state-of-the art inference methods.

Paper Structure

This paper contains 28 sections, 7 equations, 3 figures, 3 tables.

Figures (3)

  • Figure 1: A 6-layer 779-parameter ELU network was trained for 50 epochs on 20-taxa JC alignments with the hamming distance $d_H$ as the function's argument. Both the network $\hat{d}_{JC}$ trained with $d_{JC}$ as the target and the network $\hat{d}_{T}$ trained on the expected divergence $d_T$ closely approximate the function $4\ln(1-3x/4)/3$. Both networks predict a constant value beyond the point $x=0.75$ at which the Jukes-Cantor distance goes to infinity. This is appropriate behavior for a phylogenetic reconstruction algorithm. For comparison, the order 50 Maclaurin series (a function with 100 parameters) provides a similar approximation of the log function but rapidly diverges at $x=0.75$. The fact that the learned transformations outperform maximum likelihood distances suggests that these transformations convey additional information about the evolutionary process, such as a 'prior' on the BD process generating trees in the training set. In this case, the ceiling learned by $\hat{d}_T$ network is $4.840$, higher than the mean diameter of trees in the training dataset ($3.697$) but considerably less than their maximum diameter ($19.145$).
  • Figure 2: Scaling of performance with input size for LG networks. Top: $d_{RF}$ decreases with sequence length. Bottom: $d_{RF}$ increases as the number of taxa are increased. In both figures results are averaged across 500 trees; error bars represent the $50\%$ IQR.
  • Figure 3: Distribution of tree diameters.