Table of Contents
Fetching ...

Performance of Gaussian Mixture Model Classifiers on Embedded Feature Spaces

Jeremy Chopin, Rozenn Dahyot

TL;DR

The findings are, that on these tested embedded spaces, one gaussian component in the GMMs is often enough for capturing each class, and it is observed that ImageBind often provides better performance than CLIP for classification of image datasets even when these embedded spaces are compressed using PCA.

Abstract

Data embeddings with CLIP and ImageBind provide powerful features for the analysis of multimedia and/or multimodal data. We assess their performance here for classification using a Gaussian Mixture models (GMMs) based layer as an alternative to the standard Softmax layer. GMMs based classifiers have recently been shown to have interesting performances as part of deep learning pipelines trained end-to-end. Our first contribution is to investigate GMM based classification performance taking advantage of the embedded spaces CLIP and ImageBind. Our second contribution is in proposing our own GMM based classifier with a lower parameters count than previously proposed. Our findings are, that in most cases, on these tested embedded spaces, one gaussian component in the GMMs is often enough for capturing each class, and we hypothesize that this may be due to the contrastive loss used for training these embedded spaces that naturally concentrates features together for each class. We also observed that ImageBind often provides better performance than CLIP for classification of image datasets even when these embedded spaces are compressed using PCA.

Performance of Gaussian Mixture Model Classifiers on Embedded Feature Spaces

TL;DR

The findings are, that on these tested embedded spaces, one gaussian component in the GMMs is often enough for capturing each class, and it is observed that ImageBind often provides better performance than CLIP for classification of image datasets even when these embedded spaces are compressed using PCA.

Abstract

Data embeddings with CLIP and ImageBind provide powerful features for the analysis of multimedia and/or multimodal data. We assess their performance here for classification using a Gaussian Mixture models (GMMs) based layer as an alternative to the standard Softmax layer. GMMs based classifiers have recently been shown to have interesting performances as part of deep learning pipelines trained end-to-end. Our first contribution is to investigate GMM based classification performance taking advantage of the embedded spaces CLIP and ImageBind. Our second contribution is in proposing our own GMM based classifier with a lower parameters count than previously proposed. Our findings are, that in most cases, on these tested embedded spaces, one gaussian component in the GMMs is often enough for capturing each class, and we hypothesize that this may be due to the contrastive loss used for training these embedded spaces that naturally concentrates features together for each class. We also observed that ImageBind often provides better performance than CLIP for classification of image datasets even when these embedded spaces are compressed using PCA.

Paper Structure

This paper contains 32 sections, 10 equations, 1 figure, 8 tables.

Figures (1)

  • Figure 1: Evolution of the accuracy depending of the cumulative variance ratio kept (in %) after PCA decomposition on features provided by the pretrained embedding spaces CLIP (plain lines) and ImageBind (dashed lines). The results are provided for the CIFAR100 dataset ((a) and (b)) and the ESC-50 dataset ((c) and (d)) where 20 different percentages (from 5% to 100%) have been used to plot the curves (cf. Sec. \ref{['sec:PCA']}).