Table of Contents
Fetching ...

Many Perception Tasks are Highly Redundant Functions of their Input Data

Rahul Ramesh, Anthony Bisulco, Ronald W. DiTullio, Linran Wei, Vijay Balasubramanian, Kostas Daniilidis, Pratik Chaudhari

TL;DR

The paper addresses why many perception tasks remain highly predictable even when input data are projected into subspaces with reduced variance. It systematically analyzes projections defined by PCA, Fourier, and wavelet bases across diverse tasks (classification, semantic segmentation, optical flow, depth estimation, and vocalization discrimination), using mutual information and partial information decomposition to reveal redundancy and synergy among subspaces. The key finding is that while the principal subspace is most predictive, substantial information about the task is distributed across the entire spectrum, including tail bands and even random subspaces, with deep networks predominantly relying on head information. These results have implications for neuroscience and deep learning theory, suggesting that redundancy in natural signals and tasks underpins robust representations and may inform more efficient learning strategies and architectural choices in practice.

Abstract

We show that many perception tasks, from visual recognition, semantic segmentation, optical flow, depth estimation to vocalization discrimination, are highly redundant functions of their input data. Images or spectrograms, projected into different subspaces, formed by orthogonal bases in pixel, Fourier or wavelet domains, can be used to solve these tasks remarkably well regardless of whether it is the top subspace where data varies the most, some intermediate subspace with moderate variability--or the bottom subspace where data varies the least. This phenomenon occurs because different subspaces have a large degree of redundant information relevant to the task.

Many Perception Tasks are Highly Redundant Functions of their Input Data

TL;DR

The paper addresses why many perception tasks remain highly predictable even when input data are projected into subspaces with reduced variance. It systematically analyzes projections defined by PCA, Fourier, and wavelet bases across diverse tasks (classification, semantic segmentation, optical flow, depth estimation, and vocalization discrimination), using mutual information and partial information decomposition to reveal redundancy and synergy among subspaces. The key finding is that while the principal subspace is most predictive, substantial information about the task is distributed across the entire spectrum, including tail bands and even random subspaces, with deep networks predominantly relying on head information. These results have implications for neuroscience and deep learning theory, suggesting that redundancy in natural signals and tasks underpins robust representations and may inform more efficient learning strategies and architectural choices in practice.

Abstract

We show that many perception tasks, from visual recognition, semantic segmentation, optical flow, depth estimation to vocalization discrimination, are highly redundant functions of their input data. Images or spectrograms, projected into different subspaces, formed by orthogonal bases in pixel, Fourier or wavelet domains, can be used to solve these tasks remarkably well regardless of whether it is the top subspace where data varies the most, some intermediate subspace with moderate variability--or the bottom subspace where data varies the least. This phenomenon occurs because different subspaces have a large degree of redundant information relevant to the task.
Paper Structure (40 sections, 6 equations, 23 figures, 2 tables)

This paper contains 40 sections, 6 equations, 23 figures, 2 tables.

Figures (23)

  • Figure 1: (a) Eigenvalues of the pixel-wise covariance matrix for inputs and outputs of different tasks are spread across a large range and decay quickly. (b) Variance or energy decays quickly with an increase in the index for PCA, Fourier and wavelet bases. (c) Index of wavelet or Fourier basis element (y-axis) that has the highest amplitude for images projected onto a PCA eigenvector of a particular index (x-axis). High Fourier and wavelet indices (large radial frequency and large scale, respectively) correspond to PCA eigenvectors with higher indices (or smaller eigenvalues).
  • Figure 2: Schematic of Principal Components Analysis, Fourier and wavelet basis.
  • Figure 3: Panel (a) shows that the image, when projected on a high frequency band (30--45) cannot be recognized by the human eye; and yet a network trained on such images can get more than 65% test accuracy. We show the test accuracy (for CIFAR10 (b) and ImageNet (c)) of networks trained on images projected onto different subspaces. Remarkably, for ImageNet, all frequency bands achieve more than 60% accuracy. Almost all PCA subspaces, radial frequencies and scales are useful for image classification on CIFAR-10 and ImageNet; observe that low pass, band pass and low-index high pass regimes all obtain good test accuracy. However, the head of the spectrum usually contains more discriminative information than the tail. (d) For dense perception tasks such as semantic segmentation, optical flow and depth prediction, the results are consistent with classification, i.e., the information for the task is also present redundantly across the spectrum. Many frequency bands result in remarkably low errors on these tasks. Error barely improves with index for low pass filters, indicating diminishing returns on these tasks as higher frequencies are included in the data.
  • Figure 4: Error on perception tasks is remarkably low even when networks are trained on images projected onto a random linear subspace; see \ref{['app:rnd_bands']} for the parameters of the randomly chosen center frequencies and widths for the band pass filter. This suggests that information for the task is present throughout the spectrum. The "explained" power is less than 20% for all bands of randomly chosen frequencies, for all three panels.
  • Figure 5: The amplitude of noise in images from the M3ED dataset is roughly constant across the spectrum and smaller than the signal, even in the tail. Signal-to-noise ratio (SNR) is larger in the head than in the tail.
  • ...and 18 more figures