Table of Contents
Fetching ...

Modeling the mutational dynamics of very short tandem repeats

Amos Onn, Tzipy Marx, Liming Tao, Tamir Biezuner, Ehud Shapiro, Christoph A. Klein, Peter F. Stadler

Abstract

Short tandem repeats (STRs) are low-entropy regions in the genome, consisting of a short (1-6 bp) unit that is consecutively repeated multiple times. They are known for high mutational instability, due to so-called stutter-mutations, in which the number of units in the run increases or descreases. In particular, STRs with repeat unit length of 1-2 bp are prone to mutate even within several cell divisions. The extremely rapid accumulation of variation makes them interesting phylogenetic markers for retrospective single-cell lineage reconstruction. Here we model their mutational dynamics at the level of individual repeat unit type and then aggregate length variations over many STR loci with the aim of obtaining a very fast ``molecular clock''. We calibrate our model based on several datasets with known lineage structure prepared from cultured cells. We find that the mutational dynamics of STRs are reasonably consistent for a given cell line, but vary among different ones. This suggests that the dynamics are not entirely explained by mutations in caretaker genes, rather, various other factors play a role -- possibly tissue origin and differentiation state. Further data and research is necessary to asses their relative effects.

Modeling the mutational dynamics of very short tandem repeats

Abstract

Short tandem repeats (STRs) are low-entropy regions in the genome, consisting of a short (1-6 bp) unit that is consecutively repeated multiple times. They are known for high mutational instability, due to so-called stutter-mutations, in which the number of units in the run increases or descreases. In particular, STRs with repeat unit length of 1-2 bp are prone to mutate even within several cell divisions. The extremely rapid accumulation of variation makes them interesting phylogenetic markers for retrospective single-cell lineage reconstruction. Here we model their mutational dynamics at the level of individual repeat unit type and then aggregate length variations over many STR loci with the aim of obtaining a very fast ``molecular clock''. We calibrate our model based on several datasets with known lineage structure prepared from cultured cells. We find that the mutational dynamics of STRs are reasonably consistent for a given cell line, but vary among different ones. This suggests that the dynamics are not entirely explained by mutations in caretaker genes, rather, various other factors play a role -- possibly tissue origin and differentiation state. Further data and research is necessary to asses their relative effects.

Paper Structure

This paper contains 5 sections, 1 theorem, 8 equations, 4 figures, 1 table.

Key Result

lemma 1

Let $\mathbf{\tilde{R}}$ be a symmetric rate matrix satisfying $\mathbf{\tilde{R}}\mathbf{1}=\mathbf{o}$, let $\mathbf{p}$ be a strictly positive (row) vector and set $\mathbf{R}\mathrel{:=} \mathop{diag}{(\mathbf{p})}^{-1}\mathbf{\tilde{R}}$. Then $\mathbf{p}$ is a (left) eigenvector of $\exp(t\mat

Figures (4)

  • Figure 1: Stationary distribution of STR length for the repeat unit types $\tau\in\{\mathtt{A},\mathtt{AC},\mathtt{AG},\mathtt{AT}\}$ in the reference genome hg38.
  • Figure 2: Distribution of locus-specific rates $\mu$ estimated independently for the eight data sets, and grouped by cell-line: the three DU145 trees are subsets of the same tree; the three HESC trees are separately generated; the two HCT116 trees are grouped together, despite a genetic modification in the cell-line seeding HCT116-MSS. See also \ref{['table:tree_sizes']}. Top row are the raw coefficients; bottom row they are scaled by the slope coefficients of linear regression within each group.
  • Figure 3: Linear regressions of the locus-specific rate parameters $\mu(\ell)$ between pairs of samples: R-values (left) and estimated slopes (right). The slopes are column against row, so that a higher-than-$1$ slope means the tree of the column mutates more quickly than that of the row.
  • Figure 4: Optimised rate-matrix model parameters, estimated separately for each tree and repeat unit type.

Theorems & Definitions (2)

  • lemma 1
  • proof