The Normalized Cross Density Functional: A Framework to Quantify Statistical Dependence for Random Processes

Bo Hu; Jose C. Principe

The Normalized Cross Density Functional: A Framework to Quantify Statistical Dependence for Random Processes

Bo Hu, Jose C. Principe

TL;DR

A new symmetric and self-adjoint cross density kernel is defined through a recursive bidirectional statistical mapping between conditional densities of continuous random processes, which estimates their statistical dependence.

Abstract

This paper presents a novel approach to measuring statistical dependence between two random processes (r.p.) using a positive-definite function called the Normalized Cross Density (NCD). NCD is derived directly from the probability density functions of two r.p. and constructs a data-dependent Hilbert space, the Normalized Cross-Density Hilbert Space (NCD-HS). By Mercer's Theorem, the NCD norm can be decomposed into its eigenspectrum, which we name the Multivariate Statistical Dependence (MSD) measure, and their sum, the Total Dependence Measure (TSD). Hence, the NCD-HS eigenfunctions serve as a novel embedded feature space, suitable for quantifying r.p. statistical dependence. In order to apply NCD directly to r.p. realizations, we introduce an architecture with two multiple-output neural networks, a cost function, and an algorithm named the Functional Maximal Correlation Algorithm (FMCA). With FMCA, the two networks learn concurrently by approximating each other's outputs, extending the Alternating Conditional Expectation (ACE) for multivariate functions. We mathematically prove that FMCA learns the dominant eigenvalues and eigenfunctions of NCD directly from realizations. Preliminary results with synthetic data and medium-sized image datasets corroborate the theory. Different strategies for applying NCD are proposed and discussed, demonstrating the method's versatility and stability beyond supervised learning. Specifically, when the two r.p. are high-dimensional real-world images and a white uniform noise process, FMCA learns factorial codes, i.e., the occurrence of a code guarantees that a specific training set image was present, which is important for feature learning.

The Normalized Cross Density Functional: A Framework to Quantify Statistical Dependence for Random Processes

TL;DR

Abstract

Paper Structure (23 sections, 11 theorems, 56 equations, 10 figures, 3 tables, 1 algorithm)

This paper contains 23 sections, 11 theorems, 56 equations, 10 figures, 3 tables, 1 algorithm.

Introduction
Measuring Statistical Dependence with a Novel Variational Approach
Bidirectional recursion, normalized cross density, and dependence measure
Learning framework with r.p. realizations
Solving the variational eigenproblem with functional maximal correlation algorithm
Rényi's maximal correlation is the second largest eigenvalue of NCD
Functional maximal correlation algorithm
FMCA learns CDR's orthonormal decomposition
FMCA applications
Conventional classification and regression
Markov chain aggregation
Learning r.p. encoders of images
Experiments for measuring statistical dependence
Experiments for Markov chain aggregation
Experiments for coding real-world images
...and 8 more sections

Key Result

Lemma 2

(NCD definition) The functions $K(x, x') = \frac{p(x, x')}{p^{\frac{1}{2}}(x) p^{\frac{1}{2}}(x')}$ and $K(u, u') = \frac{p(u, u')}{p^{\frac{1}{2}}(u) p^{\frac{1}{2}}(u')}$ are positive-definite kernel functions, referred to as the Normalized Cross Density (NCD). Each function defines a linear opera

Figures (10)

Figure 1: Diagram of the bidirectional recursion process. (a) Given two r.p. $\mathbf{x}$ and $\mathbf{u}$ with a joint density, we treat the two conditional densities as bidirectional iterative functions. By recursively applying the two conditional mappings, two internal variables $\mathbf{x}'$ and $\mathbf{u}'$ are created. (b) The joint densities $p(x, x')$ and $p(u, u')$ obtained by applying marginalization evaluate the level of independence in the space of the internal variables, represented by red and blue regions. (c) Each joint density induces its NCD and Hilbert space (NCD-HS). By applying Mercer's theorem, the eigenspectrum and eigenfunctions are obtained. The spectrum is the statistical dependence measure, MSD.
Figure 2: Diagram describing the architecture and the FMCA cost function to find the leading eigenvalues of the NCD. The pipeline has three parts: constructs the input and reference data latent spaces, optimizes the two neural networks, and obtains the functional decomposition by applying two normalization schemes. (a) Suppose two r.p. are given, FMCA is applied to quantify the statistical dependence between them. For the reference r.p., we will follow design rules introduced in Section \ref{['reference_process']} for both supervised and unsupervised learning. (b) The optimization of the two neural networks follows a special bidirectional optimization scheme, where the output of each network is used as the target for the other in alternation. (c) After the networks are trained, we apply two normalization schemes to obtain the NCD eigenvalues and the CDR decomposition.
Figure 3: Results for joint Gaussian distributions with varying correlation coefficients. Visualization of bases, learning curves, and CDR for standardized Gaussians at correlation $\rho=0.86$. (a) shows the growth of learned MSD eigenvalues as correlations between two distributions increase. (b) compares approximated basis functions with theoretical ground truth, given by Hermite polynomials. (c) shows the smooth convergence of MSD eigenvalues, sequentially from smallest to largest. (d) shows the approximated CDR, which combines learned bases and spectrum, with a low error of $10^{-4}$. The visualization (d) is created by multiplying the CDR and marginals product for joint density.
Figure 4: MSD of FMCA, compared to those produced by KICA/HSIC and the ground truth from the Nyström method. Despite varying TSD definitions, KICA and HSIC's decompositions are consistent and comparable to MSD. Due to Gaussian kernels in RKHS with kernel size as a hyperparameter, their approximations are always biased. FMCA, instead, produces the exact approximation to the ground truth. With similar TSD, MSD can greatly vary, which scalar measures like mutual information fail to capture. Eigenvalues of KICA/HSIC are generated by decomposing the the centered and normalized Gram matrix for the cross-covariance operator.
Figure 5: Advantages of FMCA over KICA, HSIC, and MINE. (a) For a multivariate standardized Gaussian, all measures capture increasing statistical dependence as correlations increase, with less variance in FMCA. (b) In the SPIRAL dataset with varying noise levels, FMCA outperforms KICA and HSIC in terms of accuracy and stability. (c) MINE can diverge as the Gaussians become nearly a one-to-one correspondence; FMCA avoids this instability issue by truncating small eigenvalues.
...and 5 more figures

Theorems & Definitions (17)

Definition 1
Lemma 2
Corollary 3
Lemma 4
Definition 5
Lemma 6
Remark 7
Lemma 8
Example 1
Definition 9
...and 7 more

The Normalized Cross Density Functional: A Framework to Quantify Statistical Dependence for Random Processes

TL;DR

Abstract

The Normalized Cross Density Functional: A Framework to Quantify Statistical Dependence for Random Processes

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (10)

Theorems & Definitions (17)