Table of Contents
Fetching ...

Towards Self-Supervised Covariance Estimation in Deep Heteroscedastic Regression

Megh Shukla, Aziz Shameem, Mathieu Salzmann, Alexandre Alahi

TL;DR

This work tackles the challenge of estimating input-dependent covariance in deep heteroscedastic regression without direct supervision. It analyzes $KL$ divergence and the $2$-Wasserstein distance, deriving a stable non-commutative covariance upper bound to avoid costly eigendecompositions, and introduces a neighborhood-based pseudo-labeling strategy for self-supervision. Empirically, the proposed $2$-Wasserstein bound combined with pseudo-labels achieves accurate mean and covariance estimates at lower computational cost across synthetic and real datasets, including human pose, with a notable benefit from a hybrid training approach. Altogether, the approach offers a practical, scalable pathway to uncertainty estimation in complex regression tasks.

Abstract

Deep heteroscedastic regression models the mean and covariance of the target distribution through neural networks. The challenge arises from heteroscedasticity, which implies that the covariance is sample dependent and is often unknown. Consequently, recent methods learn the covariance through unsupervised frameworks, which unfortunately yield a trade-off between computational complexity and accuracy. While this trade-off could be alleviated through supervision, obtaining labels for the covariance is non-trivial. Here, we study self-supervised covariance estimation in deep heteroscedastic regression. We address two questions: (1) How should we supervise the covariance assuming ground truth is available? (2) How can we obtain pseudo labels in the absence of the ground-truth? We address (1) by analysing two popular measures: the KL Divergence and the 2-Wasserstein distance. Subsequently, we derive an upper bound on the 2-Wasserstein distance between normal distributions with non-commutative covariances that is stable to optimize. We address (2) through a simple neighborhood based heuristic algorithm which results in surprisingly effective pseudo labels for the covariance. Our experiments over a wide range of synthetic and real datasets demonstrate that the proposed 2-Wasserstein bound coupled with pseudo label annotations results in a computationally cheaper yet accurate deep heteroscedastic regression.

Towards Self-Supervised Covariance Estimation in Deep Heteroscedastic Regression

TL;DR

This work tackles the challenge of estimating input-dependent covariance in deep heteroscedastic regression without direct supervision. It analyzes divergence and the -Wasserstein distance, deriving a stable non-commutative covariance upper bound to avoid costly eigendecompositions, and introduces a neighborhood-based pseudo-labeling strategy for self-supervision. Empirically, the proposed -Wasserstein bound combined with pseudo-labels achieves accurate mean and covariance estimates at lower computational cost across synthetic and real datasets, including human pose, with a notable benefit from a hybrid training approach. Altogether, the approach offers a practical, scalable pathway to uncertainty estimation in complex regression tasks.

Abstract

Deep heteroscedastic regression models the mean and covariance of the target distribution through neural networks. The challenge arises from heteroscedasticity, which implies that the covariance is sample dependent and is often unknown. Consequently, recent methods learn the covariance through unsupervised frameworks, which unfortunately yield a trade-off between computational complexity and accuracy. While this trade-off could be alleviated through supervision, obtaining labels for the covariance is non-trivial. Here, we study self-supervised covariance estimation in deep heteroscedastic regression. We address two questions: (1) How should we supervise the covariance assuming ground truth is available? (2) How can we obtain pseudo labels in the absence of the ground-truth? We address (1) by analysing two popular measures: the KL Divergence and the 2-Wasserstein distance. Subsequently, we derive an upper bound on the 2-Wasserstein distance between normal distributions with non-commutative covariances that is stable to optimize. We address (2) through a simple neighborhood based heuristic algorithm which results in surprisingly effective pseudo labels for the covariance. Our experiments over a wide range of synthetic and real datasets demonstrate that the proposed 2-Wasserstein bound coupled with pseudo label annotations results in a computationally cheaper yet accurate deep heteroscedastic regression.

Paper Structure

This paper contains 18 sections, 3 theorems, 19 equations, 14 figures, 3 tables, 1 algorithm.

Key Result

Lemma 1

Let $\bf{S} = \{{\bm{x}}, {\bm{y}}_i\}_{i=1}^{N}$ be a set of samples drawn from the unknown target $P(Y | X) = \mathcal{N}(\mu_Y(X), \Sigma_Y(X))$ for a given ${\bm{x}}$. We write each label ${\bm{y}}_i$ as a distribution $\mathcal{N}({\bm{y}}_i, \Sigma_Y^{\text{(prior)}}(X))$. Then, the optimal so

Figures (14)

  • Figure 1: Sub-optimal convergence due to residuals (Section: \ref{['sec:KL']}). In addition to feature granularity seitzer2022on, subpar convergence may occur due to the sensitvity of the negative log-likelihood and the KL-Divergence to residuals in the mini-batch. While we show that the KL-Divergence can act as a regularizer over the learnt covariance, the gradients for both the methods are dominated by the residual term, slowing down convergence.
  • Figure 2: Visualizing convergence in bivariate regression (Section: \ref{['sec:2-w']}). We observe that the KL-Divergence and likelihood based methods: vanilla negative log-likelihood and Faithful stirn2023faithful result in unstable convergence due to the sensitivity of the methods to the residuals. In comparison, the 2-Wasserstein based methods are more stable and accurate. This observation can also be replicated when the predicted mean is initialized at the same location as the true mean, shown in appendix/Fig.\ref{['fig:bivariate']} (b) (Note: metrics NLL / KL and 2-W are plotted in log-scale)
  • Figure 3: Pseudo-Label (Section \ref{['sec:pseudolabel']}) Given ${\bm{x}}_0$, its pseudo-label is the variance in the targets ${\bm{y}}$ corresponding to samples which are the nearest neighbors of ${\bm{x}}_0$. Samples closer to ${\bm{x}}_0$ are given more importance than samples further away.
  • Figure 4: We sample from the ground truth sinusoidal $y = |x| \textrm{ sin } (2 \pi x)$ with $\sigma(x) = |x|$ and train our networks using different objectives. The 2-Wasserstein distance trained using pseudo-labels is able to converge to the accurate mean and variance faster since it does not depend upon residuals or convergence of the mean estimator to learn the variance.
  • Figure 5: (Multivariate: Metrics.) We simulate multivariate data with heteroscedastic covariance of increasing dimensionality (top row: 8, bottom row: 24). We observe that modeling heteroscedasticity is challenging without annotations, with some popular approaches diverging away from the true distribution. Our results highlight the potential of self-supervision for improved convergence.
  • ...and 9 more figures

Theorems & Definitions (6)

  • Lemma 1: Calibration
  • Theorem 1: 2-Wasserstein bound for non-commutative covariances
  • proof
  • Proposition 1
  • proof
  • proof