The Conditional Cauchy-Schwarz Divergence with Applications to Time-Series Data and Sequential Decision Making

Shujian Yu; Hongming Li; Sigurd Løkse; Robert Jenssen; José C. Príncipe

The Conditional Cauchy-Schwarz Divergence with Applications to Time-Series Data and Sequential Decision Making

Shujian Yu, Hongming Li, Sigurd Løkse, Robert Jenssen, José C. Príncipe

TL;DR

The classic CS divergence is extended to quantify the closeness between two conditional distributions and it is shown that the developed conditional CS divergence can be elegantly estimated by a kernel density estimator from given samples.

Abstract

The Cauchy-Schwarz (CS) divergence was developed by Príncipe et al. in 2000. In this paper, we extend the classic CS divergence to quantify the closeness between two conditional distributions and show that the developed conditional CS divergence can be simply estimated by a kernel density estimator from given samples. We illustrate the advantages (e.g., rigorous faithfulness guarantee, lower computational complexity, higher statistical power, and much more flexibility in a wide range of applications) of our conditional CS divergence over previous proposals, such as the conditional KL divergence and the conditional maximum mean discrepancy. We also demonstrate the compelling performance of conditional CS divergence in two machine learning tasks related to time series data and sequential inference, namely time series clustering and uncertainty-guided exploration for sequential decision making. The code of conditional CS divergence is available at https://github.com/SJYuCNEL/conditional_CS_divergence.

The Conditional Cauchy-Schwarz Divergence with Applications to Time-Series Data and Sequential Decision Making

TL;DR

Abstract

Paper Structure (36 sections, 4 theorems, 92 equations, 10 figures, 9 tables, 4 algorithms)

This paper contains 36 sections, 4 theorems, 92 equations, 10 figures, 9 tables, 4 algorithms.

Introduction
Background Knowledge
Problem Formulation
Existing Measures of $D(p (\mathbf{y}|\mathbf{x}); q(\mathbf{y}|\mathbf{x}))$
The Conditional Cauchy-Schwarz divergence
Extending Cauchy-Schwarz divergence for conditional distributions
Two special cases of conditional CS divergence
$p(\mathbf{y}_1|\mathbf{x})$ with respect to $p(\mathbf{y}_2|\mathbf{x})$
$p(\mathbf{y}|\mathbf{x}_1)$ with respect to $p(\mathbf{y}|\{\mathbf{x}_1,\mathbf{x}_2\})$
Numerical Simulations on Synthetic Data
Simulation I
Simulation II
Applications to Time Series Data and Sequential Decision Making
Time Series Clustering
Uncertainty-Guided Exploration for Sequential Decision Making
...and 21 more sections

Key Result

Proposition 1

The conditional CS divergence defined in Eq. (eq:conditional_CS) is a "faithful" measure on the closeness between $p(\mathbf{y}|\mathbf{x})$ and $q(\mathbf{y}|\mathbf{x})$.

Figures (10)

Figure 1: To evaluate the expected value of cross-distribution similarity for $\mathbf{y}_i^p$, the weight on $L_{ij}^{pq}$ is only determined by $K_{ij}^{pq}$, and is independent to $K_{i1}^{pq}$, $\cdots$, $K_{i,j-1}^{pq}$, $K_{i,j+1}^{pq}$, $\cdots$, $K_{iN}^{pq}$.
Figure 2: The root mean square error (RMSE) of the regression network trained with MSE loss and conditional CS divergence loss on test data in each epoch.
Figure 3: The ground truth (GT) causal graph and that was identified by linear Granger causality (LGC), kernel Granger causality (KGC), transfer entropy (TE) with $k$NN estimator, and our causal score with CS divergence $(\text{CS})^{2}$. The blue solid line represents the detected bivariate causal direction (after the significance test). The orange dashed curve represents the anti-causal direction that could be incorrectly detected (i.e., a false positive). The ratio behind the curve is the possibility of a false positive over $10$ independent trials.
Figure 4: The ground truth task structure (first column) and that is learned by the conditional CS divergence (second column); the conditional KL divergence (third column); the conditional von Neumann divergence (fourth column); and the conditional MMD (fifth column) when the input variable $\mathbf{x}$ is Gaussian distributed (first row) and uniformly distributed (second row), respectively. We connect each task with its $3$ nearest tasks.
Figure 5: Reformulating a time series $\{\mathbf{x}_t\}$ into a Hankel matrix.
...and 5 more figures

Theorems & Definitions (11)

Proposition 1
proof
Proposition 2
Remark 1: Difference between CS and conditional CS
Remark 2: Difference between conditional CS and conditional MMD
Proposition 3
Proposition 4
proof
proof
proof
...and 1 more

The Conditional Cauchy-Schwarz Divergence with Applications to Time-Series Data and Sequential Decision Making

TL;DR

Abstract

The Conditional Cauchy-Schwarz Divergence with Applications to Time-Series Data and Sequential Decision Making

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (10)

Theorems & Definitions (11)