Table of Contents
Fetching ...

The Conditional Cauchy-Schwarz Divergence with Applications to Time-Series Data and Sequential Decision Making

Shujian Yu, Hongming Li, Sigurd Løkse, Robert Jenssen, José C. Príncipe

TL;DR

The classic CS divergence is extended to quantify the closeness between two conditional distributions and it is shown that the developed conditional CS divergence can be elegantly estimated by a kernel density estimator from given samples.

Abstract

The Cauchy-Schwarz (CS) divergence was developed by Príncipe et al. in 2000. In this paper, we extend the classic CS divergence to quantify the closeness between two conditional distributions and show that the developed conditional CS divergence can be simply estimated by a kernel density estimator from given samples. We illustrate the advantages (e.g., rigorous faithfulness guarantee, lower computational complexity, higher statistical power, and much more flexibility in a wide range of applications) of our conditional CS divergence over previous proposals, such as the conditional KL divergence and the conditional maximum mean discrepancy. We also demonstrate the compelling performance of conditional CS divergence in two machine learning tasks related to time series data and sequential inference, namely time series clustering and uncertainty-guided exploration for sequential decision making. The code of conditional CS divergence is available at https://github.com/SJYuCNEL/conditional_CS_divergence.

The Conditional Cauchy-Schwarz Divergence with Applications to Time-Series Data and Sequential Decision Making

TL;DR

The classic CS divergence is extended to quantify the closeness between two conditional distributions and it is shown that the developed conditional CS divergence can be elegantly estimated by a kernel density estimator from given samples.

Abstract

The Cauchy-Schwarz (CS) divergence was developed by Príncipe et al. in 2000. In this paper, we extend the classic CS divergence to quantify the closeness between two conditional distributions and show that the developed conditional CS divergence can be simply estimated by a kernel density estimator from given samples. We illustrate the advantages (e.g., rigorous faithfulness guarantee, lower computational complexity, higher statistical power, and much more flexibility in a wide range of applications) of our conditional CS divergence over previous proposals, such as the conditional KL divergence and the conditional maximum mean discrepancy. We also demonstrate the compelling performance of conditional CS divergence in two machine learning tasks related to time series data and sequential inference, namely time series clustering and uncertainty-guided exploration for sequential decision making. The code of conditional CS divergence is available at https://github.com/SJYuCNEL/conditional_CS_divergence.
Paper Structure (36 sections, 4 theorems, 92 equations, 10 figures, 9 tables, 4 algorithms)

This paper contains 36 sections, 4 theorems, 92 equations, 10 figures, 9 tables, 4 algorithms.

Key Result

Proposition 1

The conditional CS divergence defined in Eq. (eq:conditional_CS) is a "faithful" measure on the closeness between $p(\mathbf{y}|\mathbf{x})$ and $q(\mathbf{y}|\mathbf{x})$.

Figures (10)

  • Figure 1: To evaluate the expected value of cross-distribution similarity for $\mathbf{y}_i^p$, the weight on $L_{ij}^{pq}$ is only determined by $K_{ij}^{pq}$, and is independent to $K_{i1}^{pq}$, $\cdots$, $K_{i,j-1}^{pq}$, $K_{i,j+1}^{pq}$, $\cdots$, $K_{iN}^{pq}$.
  • Figure 2: The root mean square error (RMSE) of the regression network trained with MSE loss and conditional CS divergence loss on test data in each epoch.
  • Figure 3: The ground truth (GT) causal graph and that was identified by linear Granger causality (LGC), kernel Granger causality (KGC), transfer entropy (TE) with $k$NN estimator, and our causal score with CS divergence $(\text{CS})^{2}$. The blue solid line represents the detected bivariate causal direction (after the significance test). The orange dashed curve represents the anti-causal direction that could be incorrectly detected (i.e., a false positive). The ratio behind the curve is the possibility of a false positive over $10$ independent trials.
  • Figure 4: The ground truth task structure (first column) and that is learned by the conditional CS divergence (second column); the conditional KL divergence (third column); the conditional von Neumann divergence (fourth column); and the conditional MMD (fifth column) when the input variable $\mathbf{x}$ is Gaussian distributed (first row) and uniformly distributed (second row), respectively. We connect each task with its $3$ nearest tasks.
  • Figure 5: Reformulating a time series $\{\mathbf{x}_t\}$ into a Hankel matrix.
  • ...and 5 more figures

Theorems & Definitions (11)

  • Proposition 1
  • proof
  • Proposition 2
  • Remark 1: Difference between CS and conditional CS
  • Remark 2: Difference between conditional CS and conditional MMD
  • Proposition 3
  • Proposition 4
  • proof
  • proof
  • proof
  • ...and 1 more