Table of Contents
Fetching ...

An efficient solution to Hidden Markov Models on trees with coupled branches

Farzan Vafa, Sahand Hormoz

TL;DR

The paper addresses Hidden Markov Models on trees with coupled branches, a realistic setting for biological lineages where sister cells share dependencies. It develops a dynamic-programming framework that extends forward-backward and Viterbi-style methods to tree structures with coupling, achieving a complexity of $O(|T|N^{n+1})$ (and $O(TN^3)$ for binary trees) while incorporating scaling to avoid underflow. An EM-based learning procedure estimates $a$, $b$, and $\pi$ with explicit, numerically stable update formulas, and a self-consistency check framework validates model assumptions. The work includes a Python implementation and simulations showing reliable parameter recovery and practical applicability to lineage-like data, enabling more faithful inference of hierarchical biological processes.

Abstract

Hidden Markov Models (HMMs) are powerful tools for modeling sequential data, where the underlying states evolve in a stochastic manner and are only indirectly observable. Traditional HMM approaches are well-established for linear sequences, and have been extended to other structures such as trees. In this paper, we extend the framework of HMMs on trees to address scenarios where the tree-like structure of the data includes coupled branches -- a common feature in biological systems where entities within the same lineage exhibit dependent characteristics. We develop a dynamic programming algorithm that efficiently solves the likelihood, decoding, and parameter learning problems for tree-based HMMs with coupled branches. Our approach scales polynomially with the number of states and nodes, making it computationally feasible for a wide range of applications and does not suffer from the underflow problem. We demonstrate our algorithm by applying it to simulated data and propose self-consistency checks for validating the assumptions of the model used for inference. This work not only advances the theoretical understanding of HMMs on trees but also provides a practical tool for analyzing complex biological data where dependencies between branches cannot be ignored.

An efficient solution to Hidden Markov Models on trees with coupled branches

TL;DR

The paper addresses Hidden Markov Models on trees with coupled branches, a realistic setting for biological lineages where sister cells share dependencies. It develops a dynamic-programming framework that extends forward-backward and Viterbi-style methods to tree structures with coupling, achieving a complexity of (and for binary trees) while incorporating scaling to avoid underflow. An EM-based learning procedure estimates , , and with explicit, numerically stable update formulas, and a self-consistency check framework validates model assumptions. The work includes a Python implementation and simulations showing reliable parameter recovery and practical applicability to lineage-like data, enabling more faithful inference of hierarchical biological processes.

Abstract

Hidden Markov Models (HMMs) are powerful tools for modeling sequential data, where the underlying states evolve in a stochastic manner and are only indirectly observable. Traditional HMM approaches are well-established for linear sequences, and have been extended to other structures such as trees. In this paper, we extend the framework of HMMs on trees to address scenarios where the tree-like structure of the data includes coupled branches -- a common feature in biological systems where entities within the same lineage exhibit dependent characteristics. We develop a dynamic programming algorithm that efficiently solves the likelihood, decoding, and parameter learning problems for tree-based HMMs with coupled branches. Our approach scales polynomially with the number of states and nodes, making it computationally feasible for a wide range of applications and does not suffer from the underflow problem. We demonstrate our algorithm by applying it to simulated data and propose self-consistency checks for validating the assumptions of the model used for inference. This work not only advances the theoretical understanding of HMMs on trees but also provides a practical tool for analyzing complex biological data where dependencies between branches cannot be ignored.
Paper Structure (24 sections, 70 equations, 5 figures)

This paper contains 24 sections, 70 equations, 5 figures.

Figures (5)

  • Figure 1: Diagram of a rooted tree $T$ where 0 denotes the root. A. The leaves $T_L = \{2, 4, 5, 6, 9, 10, 11\}$ and the interior $T_I = \{0, 1, 3, 7, 8\}$. As an example for node $7$, the subtree rooted at node $7$ is $T_R(7) = \{7, 8, 9, 10, 11\}$, its parent $\mathop{\mathrm{p}}\nolimits(7) = 3$, its children $\mathop{\mathrm{ch}}\nolimits(7) = \{8,9\}$, its sibling $\mathop{\mathrm{s}}\nolimits(7) = 6$, and its grandchildren $\mathop{\mathrm{gch}}\nolimits(C) = \{10, 11\}$. Also, its descendants $T_D(7)$ is denoted by the blue box and the complement $T_{\overline D}(7)$ by the red box. B. Schematic of HMT, with nodes labeled 0 to 6, where circles represent hidden states and squares represent observations. For each node $C$, the observation $O(C) = X_C$ and the hidden state $h(C) = H_C$.
  • Figure 2: A. The model parameters used to generate the simulated trees. There are two hidden states shown as orange and blue circles. The state of the root node of each tree is assigned to one of the hidden states with the shown probabilities. The observed value of each node is drawn from a Gaussian distribution with the mean and standard deviation determined by the hidden state of the node. The hidden states of the children are assigned probabilistically conditional on the state of the parent node with the transition probabilities shown. We chose the transition probabilities such that the sibling nodes are always in the same hidden state (are perfectly coupled). B. Examples of simulated trees with the hidden state of each node visualized.
  • Figure 3: A. Learned initial probabilities of the hidden state of the root of the trees after 20 iterations of our expectation maximization algorithm applied to 150 simulated trees of 5 generations. The learned shedding probabilities and transition probabilities as a function of the number of iterations is shown in panels B and C respectively. In all panels, the dashed lines show the true parameter value.
  • Figure 4: A. Learned initial probabilities of the hidden state of the root of the trees after 80 iterations of our expectation maximization algorithm with three hidden states applied to 150 simulated trees of 5 generations. The three hidden states 0, 1, and 2 are shown as orange, green, and blue respectively. The learned shedding probabilities as a function of the number of iterations is shown in panels B and C, and the learned transition probabilities as a function of the number of iterations is shown in panels D-F. In all panels, the dashed lines show the true parameter value. Panels D, E, and F show the inferred transition rates from a parent node in states 0, 1, and 2 respectively to all possible combinations of states for the children. All transitions rates vanish except for the one allowed transition.
  • Figure 5: A. Learned initial probabilities of the hidden state of the root of the trees after 35 iterations of our expectation maximization algorithm with two hidden states applied to 150 simulated trees of 5 generations. The learned shedding probabilities as a function of the number of iterations is shown in panels B and C, and the learned transition probabilities as a function of the number of iterations is shown in panels D and E. F. Pair-wise linear Pearson correlation coefficients for true model vs learned two and three state models. Inset: lineage distance $(m,n)$ for two nodes denotes the distance $m$ and $n$ from each node to the two node's most recent common ancestor. For example, sibling nodes are denoted as (1,1) and parent-child nodes as (0,1). The correlations predicted by the 3-state inferred model are consistent with that of the data whereas those from the 2-state model are not.