Cauchy-Schwarz Divergence Information Bottleneck for Regression

Shujian Yu; Xi Yu; Sigurd Løkse; Robert Jenssen; Jose C. Principe

Cauchy-Schwarz Divergence Information Bottleneck for Regression

Shujian Yu, Xi Yu, Sigurd Løkse, Robert Jenssen, Jose C. Principe

TL;DR

The paper tackles regression within the Information Bottleneck (IB) framework and introduces CS-IB, which uses a Cauchy–Schwarz divergence formulation to replace variational MI bounds and Gaussian assumptions. It defines a CS-based IB objective combining a CS-based prediction term $D_{CS}(p(y|\mathbf{x}); q_{\theta}(\hat{y}|\mathbf{x}))$ with a CS–QMI compression term $I_{CS}(\mathbf{x};\mathbf{t})$, both estimable nonparametrically via kernel methods. Theoretical results show CS divergence upper-bounds KL divergences, links to generalization and robustness, and provides an adversarial-perturbation bound; empirically CS-IB outperforms several deep IB baselines on six real-world regression tasks and two high-dimensional datasets, while achieving favorable information-plane trade-offs. The work demonstrates that nonparametric CS-based objectives can improve regression generalization and adversarial robustness without distributional assumptions on the decoder, offering a practical and scalable alternative to KL-based IB approaches.

Abstract

The information bottleneck (IB) approach is popular to improve the generalization, robustness and explainability of deep neural networks. Essentially, it aims to find a minimum sufficient representation $\mathbf{t}$ by striking a trade-off between a compression term $I(\mathbf{x};\mathbf{t})$ and a prediction term $I(y;\mathbf{t})$, where $I(\cdot;\cdot)$ refers to the mutual information (MI). MI is for the IB for the most part expressed in terms of the Kullback-Leibler (KL) divergence, which in the regression case corresponds to prediction based on mean squared error (MSE) loss with Gaussian assumption and compression approximated by variational inference. In this paper, we study the IB principle for the regression problem and develop a new way to parameterize the IB with deep neural networks by exploiting favorable properties of the Cauchy-Schwarz (CS) divergence. By doing so, we move away from MSE-based regression and ease estimation by avoiding variational approximations or distributional assumptions. We investigate the improved generalization ability of our proposed CS-IB and demonstrate strong adversarial robustness guarantees. We demonstrate its superior performance on six real-world regression tasks over other popular deep IB approaches. We additionally observe that the solutions discovered by CS-IB always achieve the best trade-off between prediction accuracy and compression ratio in the information plane. The code is available at \url{https://github.com/SJYuCNEL/Cauchy-Schwarz-Information-Bottleneck}.

Cauchy-Schwarz Divergence Information Bottleneck for Regression

TL;DR

with a CS–QMI compression term

, both estimable nonparametrically via kernel methods. Theoretical results show CS divergence upper-bounds KL divergences, links to generalization and robustness, and provides an adversarial-perturbation bound; empirically CS-IB outperforms several deep IB baselines on six real-world regression tasks and two high-dimensional datasets, while achieving favorable information-plane trade-offs. The work demonstrates that nonparametric CS-based objectives can improve regression generalization and adversarial robustness without distributional assumptions on the decoder, offering a practical and scalable alternative to KL-based IB approaches.

Abstract

by striking a trade-off between a compression term

and a prediction term

, where

refers to the mutual information (MI). MI is for the IB for the most part expressed in terms of the Kullback-Leibler (KL) divergence, which in the regression case corresponds to prediction based on mean squared error (MSE) loss with Gaussian assumption and compression approximated by variational inference. In this paper, we study the IB principle for the regression problem and develop a new way to parameterize the IB with deep neural networks by exploiting favorable properties of the Cauchy-Schwarz (CS) divergence. By doing so, we move away from MSE-based regression and ease estimation by avoiding variational approximations or distributional assumptions. We investigate the improved generalization ability of our proposed CS-IB and demonstrate strong adversarial robustness guarantees. We demonstrate its superior performance on six real-world regression tasks over other popular deep IB approaches. We additionally observe that the solutions discovered by CS-IB always achieve the best trade-off between prediction accuracy and compression ratio in the information plane. The code is available at \url{https://github.com/SJYuCNEL/Cauchy-Schwarz-Information-Bottleneck}.

Paper Structure (43 sections, 17 theorems, 135 equations, 15 figures, 9 tables)

This paper contains 43 sections, 17 theorems, 135 equations, 15 figures, 9 tables.

Introduction
Background Knowledge
Problem Formulation and Variants of IB Lagrangian
Approximation to $I(y;\mathbf{t})$
Approximation to $I(\mathbf{x};\mathbf{t})$
Cauchy-Schwarz Divergence and its Induced Measures
Cauchy-Schwarz Quadratic Mutual Information (CS-QMI)
The Cauchy-Schwarz Divergence Information Bottleneck
Estimation of CS Divergence Induced Terms
The Rationality of the Regularization Term $I_{\text{CS}}(\mathbf{x};\mathbf{t})$
Effects of $I_{\text{CS}}(\mathbf{x};\mathbf{t})$ on Generalization
Adversarial Robustness Guarantee
Experiments
Behaviors in the Information Plane
Adversarial Robustness
...and 28 more sections

Key Result

Proposition 1

rodriguez2019information With a Gaussian assumption on $q_\theta(\hat{y}|t)$, maximizing $I_\theta(y;\mathbf{t})$ essentially minimizes $D_{\text{KL}}(p(y|\mathbf{x});q_\theta (\hat{y}|\mathbf{x}))$, both of which could be approximated by minimizing a MSE loss.

Figures (15)

Figure 1: Information plane diagrams on California Housing and Beijing PM2.5 datasets.
Figure 2: Geometrical interpretation of CS-QMI and HSIC, in which $\mathcal{H}=\mathcal{F}\otimes\mathcal{G}$. When $\|\mu(\mathbb{P}_{XT})\|_{\mathcal{H}} = \| \mu(\mathbb{P}_{X}\otimes \mathbb{P}_{T}) \|_{\mathcal{H}} = \|\mu\|$, CS-QMI and HSIC has a monotonic relationship.
Figure 3: Lemma \ref{['lemma_CS_QMI_HSIC']} supporting simulation.
Figure 4: Connection between CS divergence with respect to MMD and KL divergence
Figure 5: KL divergence is infinite even though there is an overlap between $\text{supp}(p)$ and $\text{supp}(q)$, but neither is a subset of the other. CS divergence does not has such support constraint.
...and 10 more figures

Theorems & Definitions (31)

Proposition 1
proof
Remark 1
Remark 2
Proposition 2: Empirical Estimator of $D_{\text{CS}}(p(y|\mathbf{x});q_\theta(\hat{y}|\mathbf{x}))$
Remark 3
Remark 4
Proposition 3: Empirical Estimator of CS-QMI
Remark 5
Theorem 1
...and 21 more

Cauchy-Schwarz Divergence Information Bottleneck for Regression

TL;DR

Abstract

Cauchy-Schwarz Divergence Information Bottleneck for Regression

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (15)

Theorems & Definitions (31)