Table of Contents
Fetching ...

Identification in source apportionment using geometry

Bora Jin, Abhirup Datta

TL;DR

This work addresses identifiability in source apportionment modeled by $Y=WH$ by defining a population-level source attribution percentage matrix $\Phi$ that is scale-invariant and identifiable under weak probabilistic separability with stationary ergodic emissions. It develops a geometric, convex-hull based estimator: as the sample size grows, the convex hull of the row-normalized data $Y^*$ converges to the hull of the true factor rows $\mathcal{H}^*$, enabling consistent recovery of $H^*$ via a maximum-volume $K$-vertex approach. A consistent estimator of the factor means $\widetilde{\mu}$ is constructed and used to obtain $\widehat{\Phi}$ up to permutation of sources, without requiring sparsity or fixed scaling. Numerical experiments show that the proposed estimator converges to the truth as $n$ increases, for both stationary ergodic and iid emission processes, underscoring the practical feasibility of geometric identifiability for policy-relevant attribution tasks.

Abstract

Source apportionment analysis, which aims to quantify the attribution of observed concentrations of multiple air pollutants to specific sources, can be formulated as a non-negative matrix factorization (NMF) problem. However, NMF is non-unique and typically relies on unverifiable assumptions such as sparsity and uninterpretable scalings. In this manuscript, we establish identifiability of the source attribution percentage matrix under much weaker and more realistic conditions. We introduce the population-level estimand for this matrix, and show that it is scale-invariant and identifiable even when the NMF factors are not. Viewing the data as a point cloud in a conical hull, we show that a geometric estimator of the source attribution percentage matrix is consistent without any sparsity or parametric distributional assumptions, and while accommodating spatio-temporal dependence. Numerical experiments corroborate the theory.

Identification in source apportionment using geometry

TL;DR

This work addresses identifiability in source apportionment modeled by by defining a population-level source attribution percentage matrix that is scale-invariant and identifiable under weak probabilistic separability with stationary ergodic emissions. It develops a geometric, convex-hull based estimator: as the sample size grows, the convex hull of the row-normalized data converges to the hull of the true factor rows , enabling consistent recovery of via a maximum-volume -vertex approach. A consistent estimator of the factor means is constructed and used to obtain up to permutation of sources, without requiring sparsity or fixed scaling. Numerical experiments show that the proposed estimator converges to the truth as increases, for both stationary ergodic and iid emission processes, underscoring the practical feasibility of geometric identifiability for policy-relevant attribution tasks.

Abstract

Source apportionment analysis, which aims to quantify the attribution of observed concentrations of multiple air pollutants to specific sources, can be formulated as a non-negative matrix factorization (NMF) problem. However, NMF is non-unique and typically relies on unverifiable assumptions such as sparsity and uninterpretable scalings. In this manuscript, we establish identifiability of the source attribution percentage matrix under much weaker and more realistic conditions. We introduce the population-level estimand for this matrix, and show that it is scale-invariant and identifiable even when the NMF factors are not. Viewing the data as a point cloud in a conical hull, we show that a geometric estimator of the source attribution percentage matrix is consistent without any sparsity or parametric distributional assumptions, and while accommodating spatio-temporal dependence. Numerical experiments corroborate the theory.

Paper Structure

This paper contains 14 sections, 5 theorems, 36 equations, 9 figures, 1 table, 2 algorithms.

Key Result

Theorem 1

Suppose each data point $Y_i$ admits two NMF representations $Y_i^\top=W_i^\top H=\widetilde{W}_i^\top \widetilde{H}$, where $H$ and $\widetilde{H}$ are $K\times J$ non-negative matrices, $\{W_i\}_{i\ge1}$ and $\{\widetilde{W}_i\}_{i\ge1}$ are $K$-dimensional processes satisfying Assumptions asm:erg

Figures (9)

  • Figure 1: Geometric representation of multipollutant source apportionment
  • Figure S1: Box plots of NRMSE (left) and NFD (right) for $\widehat{\Phi}$ over 50 replicates as a function of $n$. The boxes display the interquartile range with median (red line) and whiskers to the minimum and maximum after outlier removal.
  • Figure S2: Scatter plots of the $K\times J=24$ elements of true versus estimated $\Phi$ over 50 replicates, with the 45-degree line in red.
  • Figure S3: Heat maps of the true $\Phi$ (left) and the estimate $\widehat{\Phi}$ (right) for a randomly selected replicate for $n=300$ (top), $n=100000$ (middle), and $n=500000$ (bottom).
  • Figure S4: Sample hull of $Y^*$ with the true $H^*$ in red dots and the estimated $\widehat{H}^*$ in blue triangles for a randomly selected replicate for $n=300$.
  • ...and 4 more figures

Theorems & Definitions (11)

  • Definition 1: Conical hull and extremal rays
  • Theorem 1: Statistical identifiability
  • Theorem 2: Hausdorff consistency of the sample convex hull
  • Corollary 1: Consistency of the maximum-volume $K$-vertex estimator
  • Proposition 1
  • Lemma S1
  • proof
  • proof : Proof of Theorem \ref{['thm:welldefn']}
  • proof : Proof of Theorem \ref{['thm:hull']}
  • proof : Proof of Corollary \ref{['cor:maxvol']}
  • ...and 1 more