Table of Contents
Fetching ...

Poisson process factorization for mutational signature analysis with genomic covariates

Alessandro Zito, Giovanni Parmigiani, Jeffrey W. Miller

TL;DR

Poisson process factorization (PPF) is introduced, which addresses the limitation of the usual approach to mutational signature analysis by employing an inhomogeneous Poisson point process model to infer mutational signatures and their activities as they vary across the genome.

Abstract

Mutational signatures are powerful summaries of the mutational processes altering the DNA of cancer cells. The usual approach to mutational signature analysis consists of decomposing the matrix of mutation counts from a sample of patients using non-negative matrix factorization (NMF). However, this ignores the heterogeneous patterns of mutation rates along the genome. In this paper, we introduce Poisson process factorization (PPF), which addresses this limitation by employing an inhomogeneous Poisson point process model to infer mutational signatures and their activities as they vary across the genome. PPF generalizes the baseline NMF model by representing a patient's exposure to each signature as a locus-specific function that depends on genomic covariates and patient-specific copy numbers via a log-linear model. This quantifies the relationships between genomic features and mutational signatures, and enables attribution of individual mutations to signatures. We develop tractable algorithms for maximum a posteriori estimation and posterior inference via Markov chain Monte Carlo. We demonstrate the method on simulated data and real data from breast cancer, using genomic covariates representing histone modifications, cell replication timing, nucleosome positioning, and DNA methylation.

Poisson process factorization for mutational signature analysis with genomic covariates

TL;DR

Poisson process factorization (PPF) is introduced, which addresses the limitation of the usual approach to mutational signature analysis by employing an inhomogeneous Poisson point process model to infer mutational signatures and their activities as they vary across the genome.

Abstract

Mutational signatures are powerful summaries of the mutational processes altering the DNA of cancer cells. The usual approach to mutational signature analysis consists of decomposing the matrix of mutation counts from a sample of patients using non-negative matrix factorization (NMF). However, this ignores the heterogeneous patterns of mutation rates along the genome. In this paper, we introduce Poisson process factorization (PPF), which addresses this limitation by employing an inhomogeneous Poisson point process model to infer mutational signatures and their activities as they vary across the genome. PPF generalizes the baseline NMF model by representing a patient's exposure to each signature as a locus-specific function that depends on genomic covariates and patient-specific copy numbers via a log-linear model. This quantifies the relationships between genomic features and mutational signatures, and enables attribution of individual mutations to signatures. We develop tractable algorithms for maximum a posteriori estimation and posterior inference via Markov chain Monte Carlo. We demonstrate the method on simulated data and real data from breast cancer, using genomic covariates representing histone modifications, cell replication timing, nucleosome positioning, and DNA methylation.

Paper Structure

This paper contains 34 sections, 3 theorems, 48 equations, 10 figures, 4 tables, 3 algorithms.

Key Result

Proposition 1

Consider a collection of inhomogeneous Poisson point processes $Z_{ij}$ on $[0,T)$ with intensity functions $\lambda_{i j}$ as in eq:Intensity_function. If $\vartheta_{kj}(t) = \theta_{kj}/T$ for all $t \in [0, T)$ and all $i,j,k$, then inference for $r_{ik}$ and $\theta_{kj}$ is equivalent under eq

Figures (10)

  • Figure 1: (a) Number of single-base substitutions in 1-megabase bins in 113 breast adenocarcinomas from the International Cancer Genome Consortium (ICGC). Alternating colored bands and numbers indicate chromosomes. (b) Top: Number of mutations in 2-kilobase bins for region chrX:88,000,001-89,000,000, highlighted by the red point. Bottom: Average signal (fold-change with respect to assay background) in the same bins for histone modification H3K9me3, in breast epithelium tissue; see Supplementary material.
  • Figure 2: Results of the simulation described in \ref{['subsec:Simulation_setup']} in terms of log(RMSE) for the parameters of the simulation. Boxplots display the values from 20 randomly generated datasets under both scenarios. The panels display the log(RMSE) of $(\hat{R}, R_0)$, $(\hat{\Theta}, \Theta_0)$, $(\hat{B}, B_0)$, and $(\hat{\Lambda}, \Lambda_0)$.
  • Figure 3: True and predicted total number of mutations from the PPF model at the megabase scale. Left: total mutations (points) and estimated total intensity (line); alternating colors and bands indicate chromosomes. Right: scatterplot of true and predicted values. a. PPF predictions using covariates from \ref{['tab:Epi_covariates']} and data on individual copy numbers. b. PPF predictions using individual copy numbers only, without covariates.
  • Figure 4: a. Posterior mean for the de novo mutational signatures, with gray bars indicating 95% credible intervals. Numbers in parentheses on the left indicate the highest cosine similarity in the COSMICv3.4 catalog. b. Posterior mean of the relevance weights $\hat{\mu}_k$ associated with each signature, with larger point size indicating larger values. Color intensity denotes the number of mutations in the data assigned to the signature via \ref{['eq:AssignentProbs']}. c. Posterior mean of the regression coefficients, $\hat{\boldsymbol{\beta}}_k$. Gray cells with no number indicate estimates for which the 95% credible interval contains zero. d. Baseline activities $\hat{\theta}_{kj}$ adjusted by total copy numbers. Columns correspond to patients, split by the clusters inferred from the normalized activities.
  • Figure 5: a. Regression coefficients and relevance weights for the signatures in the analysis. Gray boxes indicate entries where the 95% posterior credible interval contains zero. The $\times$ marks indicate cases where $\hat{\mu}_k \approx \varepsilon$. b. Number of mutations attributed to a signature in the standard NMF case (y-axis) and in the PPF model (x-axis). d. Top panel: mutation intensity for sbs3, sbs5, and sbs8 on substitution type A[C$>$A]C in patient DO220823 in genomic region chr1:15000000-16000000. The three ticks indicate the exact mutation position. Solid and dashed lines indicate the intensity from the PPF and the CompNMF model with fixed signatures, respectively. Bottom: standardized values for the covariates in the region.
  • ...and 5 more figures

Theorems & Definitions (6)

  • Proposition 1
  • Proposition 2
  • proof : Proof of \ref{['pro:baseline']}
  • proof : Proof of \ref{['pro:superposition']}
  • Lemma 1
  • proof