On the Approximation of Phylogenetic Distance Functions by Artificial Neural Networks
Benjamin K. Rosenzweig, Matthew W. Hahn
TL;DR
The paper tackles learning phylogenetic distances with neural networks when ground-truth data are scarce, focusing on distance-based approaches that feed a neighbor-joining step. It introduces symmetry-preserving, permutation-invariant architectures, including Euclidean embedding and inner-product variants, plus taxa-wise attention and spatially-aware components, designed for scalability to large taxa sets. Through simulations under JC, HKY, LG models with indels, the authors show that learned distances can rival or exceed simple pairwise metrics and approach state-of-the-art likelihood methods in many conditions, while maintaining a smaller computational footprint. They also provide theoretical motivations from Bourgain’s embedding theory and Mercer's theorem, discuss the tradeoffs of information sharing across taxa, and outline practical limitations and avenues for future enhancements, such as handling concatenated genomic blocks and long-branch attraction effects.
Abstract
Inferring the phylogenetic relationships among a sample of organisms is a fundamental problem in modern biology. While distance-based hierarchical clustering algorithms achieved early success on this task, these have been supplanted by Bayesian and maximum likelihood search procedures based on complex models of molecular evolution. In this work we describe minimal neural network architectures that can approximate classic phylogenetic distance functions and the properties required to learn distances under a variety of molecular evolutionary models. In contrast to model-based inference (and recently proposed model-free convolutional and transformer networks), these architectures have a small computational footprint and are scalable to large numbers of taxa and molecular characters. The learned distance functions generalize well and, given an appropriate training dataset, achieve results comparable to state-of-the art inference methods.
