Table of Contents
Fetching ...

Unsupervised Dynamic Feature Selection for Robust Latent Spaces in Vision Tasks

Bruno Corcuera, Carlos Eiras-Franco, Brais Cancela

TL;DR

This work tackles the degradation of vision latent representations caused by noisy or irrelevant features by introducing Dynamic Feature Selection (DDS), an unsupervised module that masks input features per sample to keep at most $M$ features before downstream processing. DDS uses a differentiable hard-concrete gate to generate a per-sample top-$M$ mask, preserving 2-D structure and enabling integration with existing architectures for unsupervised tasks such as clustering and world-model latent learning. The authors demonstrate substantial gains: improved clustering performance across multiple datasets with reduced input features, and enhanced reconstruction fidelity and agent performance in world-model RL settings, with competitive or lower parameter counts. By providing a label-free, architecture-agnostic feature selection mechanism, DDS enhances robustness and interpretability of latent spaces in vision tasks and holds promise for broad application in unsupervised learning and generative modeling.

Abstract

Latent representations are critical for the performance and robustness of machine learning models, as they encode the essential features of data in a compact and informative manner. However, in vision tasks, these representations are often affected by noisy or irrelevant features, which can degrade the model's performance and generalization capabilities. This paper presents a novel approach for enhancing latent representations using unsupervised Dynamic Feature Selection (DFS). For each instance, the proposed method identifies and removes misleading or redundant information in images, ensuring that only the most relevant features contribute to the latent space. By leveraging an unsupervised framework, our approach avoids reliance on labeled data, making it broadly applicable across various domains and datasets. Experiments conducted on image datasets demonstrate that models equipped with unsupervised DFS achieve significant improvements in generalization performance across various tasks, including clustering and image generation, while incurring a minimal increase in the computational cost.

Unsupervised Dynamic Feature Selection for Robust Latent Spaces in Vision Tasks

TL;DR

This work tackles the degradation of vision latent representations caused by noisy or irrelevant features by introducing Dynamic Feature Selection (DDS), an unsupervised module that masks input features per sample to keep at most features before downstream processing. DDS uses a differentiable hard-concrete gate to generate a per-sample top- mask, preserving 2-D structure and enabling integration with existing architectures for unsupervised tasks such as clustering and world-model latent learning. The authors demonstrate substantial gains: improved clustering performance across multiple datasets with reduced input features, and enhanced reconstruction fidelity and agent performance in world-model RL settings, with competitive or lower parameter counts. By providing a label-free, architecture-agnostic feature selection mechanism, DDS enhances robustness and interpretability of latent spaces in vision tasks and holds promise for broad application in unsupervised learning and generative modeling.

Abstract

Latent representations are critical for the performance and robustness of machine learning models, as they encode the essential features of data in a compact and informative manner. However, in vision tasks, these representations are often affected by noisy or irrelevant features, which can degrade the model's performance and generalization capabilities. This paper presents a novel approach for enhancing latent representations using unsupervised Dynamic Feature Selection (DFS). For each instance, the proposed method identifies and removes misleading or redundant information in images, ensuring that only the most relevant features contribute to the latent space. By leveraging an unsupervised framework, our approach avoids reliance on labeled data, making it broadly applicable across various domains and datasets. Experiments conducted on image datasets demonstrate that models equipped with unsupervised DFS achieve significant improvements in generalization performance across various tasks, including clustering and image generation, while incurring a minimal increase in the computational cost.

Paper Structure

This paper contains 31 sections, 4 theorems, 22 equations, 11 figures, 8 tables.

Key Result

Lemma A.1

In deep networks, could $\exists ~i \in [n-r, n] \quad | \quad \| f(x^{\text{rel}} + \epsilon^i) - f(x^{\text{rel}}) \| \gg 0$.

Figures (11)

  • Figure 1: Proposed method. The DDS module is prepended to an existing architecture (an autoencoder in this case), substituting its input with an equally shaped masked version that retains only the most relevant features. DDS is in charge of selecting, for each sample, the most relevant features for the downstream architecture to solve the unsupervised task (in this particular example, data reconstruction).
  • Figure 2: Adaptation of the DDS architecture to the World Model problem. The previously presented DDS architecture (green and blue) is augmented to yield a structured latent space with the addition of a Variational Autoencoder (red) that aims to reconstruct the masked inputs (i.e. the relevant features of the input image). The training procedure is divided in two steps: (1) the DDS is trained to learn to select the relevant features of each image without the VAE section (i.e. $\mathbf{h}=\mathbf{\hat{h}}$), and then (2) the VAE is trained to compute the $\mathbf{\hat{h}}$ reconstruction of $\mathbf{h}$.
  • Figure 3: Comparison of dream sequence generation by the World Model. a) Original Vision model with a VAE architecture. b) Proposed (VAE+DDS with M=4%) as Vision model.
  • Figure 4: DDS(10%) + ProPos clustering NMI over CIFAR-10, using a ResNet-18 as backbone.
  • Figure 5: Visualization of DDS masks generated at varying selection percentages for the CarRacing-v3 environment. The rows display different input frames. The columns show (from left to right): the original input ($X$), the output generated with the DDS ($g(\mathbf{X})$) 1% M, 2%, 3% M, 5% M, and 8% M. This figure illustrates the behavior of the DDS module in selecting salient features across different sparsity levels.
  • ...and 6 more figures

Theorems & Definitions (8)

  • Lemma A.1
  • proof
  • Lemma A.2
  • proof
  • Lemma A.3
  • proof
  • Theorem A.4
  • proof