Table of Contents
Fetching ...

A Self-Organizing Clustering System for Unsupervised Distribution Shift Detection

Sebastián Basterrech, Line Clemmensen, Gerardo Rubino

TL;DR

This work proposes a continual learning framework for monitoring and detecting distribution changes in a latent space generated by a bio-inspired self-organizing clustering and investigates the projections made by two topology-preserving maps: the Self-Organizing Map and the Scale Invariant Map.

Abstract

Modeling non-stationary data is a challenging problem in the field of continual learning, and data distribution shifts may result in negative consequences on the performance of a machine learning model. Classic learning tools are often vulnerable to perturbations of the input covariates, and are sensitive to outliers and noise, and some tools are based on rigid algebraic assumptions. Distribution shifts are frequently occurring due to changes in raw materials for production, seasonality, a different user base, or even adversarial attacks. Therefore, there is a need for more effective distribution shift detection techniques. In this work, we propose a continual learning framework for monitoring and detecting distribution changes. We explore the problem in a latent space generated by a bio-inspired self-organizing clustering and statistical aspects of the latent space. In particular, we investigate the projections made by two topology-preserving maps: the Self-Organizing Map and the Scale Invariant Map. Our method can be applied in both a supervised and an unsupervised context. We construct the assessment of changes in the data distribution as a comparison of Gaussian signals, making the proposed method fast and robust. We compare it to other unsupervised techniques, specifically Principal Component Analysis (PCA) and Kernel-PCA. Our comparison involves conducting experiments using sequences of images (based on MNIST and injected shifts with adversarial samples), chemical sensor measurements, and the environmental variable related to ozone levels. The empirical study reveals the potential of the proposed approach.

A Self-Organizing Clustering System for Unsupervised Distribution Shift Detection

TL;DR

This work proposes a continual learning framework for monitoring and detecting distribution changes in a latent space generated by a bio-inspired self-organizing clustering and investigates the projections made by two topology-preserving maps: the Self-Organizing Map and the Scale Invariant Map.

Abstract

Modeling non-stationary data is a challenging problem in the field of continual learning, and data distribution shifts may result in negative consequences on the performance of a machine learning model. Classic learning tools are often vulnerable to perturbations of the input covariates, and are sensitive to outliers and noise, and some tools are based on rigid algebraic assumptions. Distribution shifts are frequently occurring due to changes in raw materials for production, seasonality, a different user base, or even adversarial attacks. Therefore, there is a need for more effective distribution shift detection techniques. In this work, we propose a continual learning framework for monitoring and detecting distribution changes. We explore the problem in a latent space generated by a bio-inspired self-organizing clustering and statistical aspects of the latent space. In particular, we investigate the projections made by two topology-preserving maps: the Self-Organizing Map and the Scale Invariant Map. Our method can be applied in both a supervised and an unsupervised context. We construct the assessment of changes in the data distribution as a comparison of Gaussian signals, making the proposed method fast and robust. We compare it to other unsupervised techniques, specifically Principal Component Analysis (PCA) and Kernel-PCA. Our comparison involves conducting experiments using sequences of images (based on MNIST and injected shifts with adversarial samples), chemical sensor measurements, and the environmental variable related to ozone levels. The empirical study reveals the potential of the proposed approach.
Paper Structure (17 sections, 5 equations, 11 figures)

This paper contains 17 sections, 5 equations, 11 figures.

Figures (11)

  • Figure 1: Visualization of the building descriptor of the data. The distance $d^{I}(\cdot)$ computes a similarity between two distributions directly from the raw data. On the other hand, the distance $d^{M}(\cdot)$ computes a similarity between two distributions in the latent space. We denote the non-linear projections using a topographic map $\phi(\cdot)$.
  • Figure 2: Visual comparison of the significance of each of the moments for representing the information in the matrix $D$. The data corresponds to the problem of MNIST with adversarial samples in a CL setting.
  • Figure 3: Visualization of the proposed approach.
  • Figure 4: MNIST problem with adversarial samples: This example illustrates the transition between the sequence of images before and after the injected drift. The second row of images contains the sequence of adversarial samples.
  • Figure 5: Analysis of distribution shifts with fixed reference time windows: Each curve represents the shift monitoring for a specific generated data stream. We generated 30 data streams using the MNIST dataset and injected shifts using adversarial samples. The left figure was generated with data streams using chunk size 100 samples. The right figure was generated with data streams using chunks with 200 samples.
  • ...and 6 more figures