Table of Contents
Fetching ...

Gini Coefficient as a Unified Metric for Evaluating Many-versus-Many Similarity in Vector Spaces

Ben Fauber

TL;DR

This work demonstrates that Gini coefficients can be used as unified metrics to evaluate many-versus-many (all-to-all) similarity in vector spaces and selection of exemplary and iconic training samples with higher Gini coefficients leads to significantly better model performance.

Abstract

We demonstrate that Gini coefficients can be used as unified metrics to evaluate many-versus-many (all-to-all) similarity in vector spaces. Our analysis of various image datasets shows that images with the highest Gini coefficients tend to be the most similar to one another, while images with the lowest Gini coefficients are the least similar. We also show that this relationship holds true for vectorized text embeddings from various corpuses, highlighting the consistency of our method and its broad applicability across different types of data. Additionally, we demonstrate that selecting machine learning training samples that closely match the distribution of the testing dataset is far more important than ensuring data diversity. Selection of exemplary and iconic training samples with higher Gini coefficients leads to significantly better model performance compared to simply having a diverse training set with lower Gini coefficients. Thus, Gini coefficients can serve as effective criteria for selecting machine learning training samples, with our selection method outperforming random sampling methods in very sparse information settings.

Gini Coefficient as a Unified Metric for Evaluating Many-versus-Many Similarity in Vector Spaces

TL;DR

This work demonstrates that Gini coefficients can be used as unified metrics to evaluate many-versus-many (all-to-all) similarity in vector spaces and selection of exemplary and iconic training samples with higher Gini coefficients leads to significantly better model performance.

Abstract

We demonstrate that Gini coefficients can be used as unified metrics to evaluate many-versus-many (all-to-all) similarity in vector spaces. Our analysis of various image datasets shows that images with the highest Gini coefficients tend to be the most similar to one another, while images with the lowest Gini coefficients are the least similar. We also show that this relationship holds true for vectorized text embeddings from various corpuses, highlighting the consistency of our method and its broad applicability across different types of data. Additionally, we demonstrate that selecting machine learning training samples that closely match the distribution of the testing dataset is far more important than ensuring data diversity. Selection of exemplary and iconic training samples with higher Gini coefficients leads to significantly better model performance compared to simply having a diverse training set with lower Gini coefficients. Thus, Gini coefficients can serve as effective criteria for selecting machine learning training samples, with our selection method outperforming random sampling methods in very sparse information settings.

Paper Structure

This paper contains 17 sections, 15 figures, 1 table.

Figures (15)

  • Figure 1: Illustration of our proposal: the Gini coefficient can serve as a unified, singular metric to assess the many-versus-many similarity of vectors. The calculation involves the ratio between the area under the line of equality $A$ (light blue) and the area under the Lorenz curve $B$ (dark blue). The Gini coefficient for many vectors is determined by the ratio of these areas: $A/(A+B)$.
  • Figure 2: Kernel density plot showing the Gini coefficients for each class in the MNIST training dataset, which contains 60,000 instances in total and approximately 6,000 instances per class. The Gini coefficients were calculated using the flattened raw pixel values ($d = 784$) ranging from 0 to 255 for each $28 \times 28$ grayscale image. Each $d$-dimensional image vector was $\ell_2$-normalized before computing the similarity values and Gini coefficients. The Gini coefficients were $MinMax$ normalized $[0,1]$ to allow for comparison across classes.
  • Figure 3: Top-24 examples with the lowest Gini coefficients for each class in the MNIST training dataset, which contains 60,000 instances in total and approximately 6,000 instances per class. The Gini coefficients were calculated using the flattened raw pixel values ($d = 784$) ranging from 0 to 255 for each $28 \times 28$ grayscale image. Each $d$-dimensional image vector was $\ell_2$-normalized before computing the similarity values and Gini coefficients. The Gini coefficients were $MinMax$ normalized $[0,1]$ to allow for comparison across classes.
  • Figure 4: Top-24 examples with the highest Gini coefficients for each class in the MNIST training dataset, which contains 60,000 instances in total and approximately 6,000 instances per class. The Gini coefficients were calculated using the flattened raw pixel values ($d = 784$) ranging from 0 to 255 for each $28 \times 28$ grayscale image. Each $d$-dimensional image vector was $\ell_2$-normalized before computing the similarity values and Gini coefficients. The Gini coefficients were $MinMax$ normalized $[0,1]$ to allow for comparison across classes.
  • Figure 5: Top-24 examples with the lowest Gini coefficients for each class in the Fashion-MNIST training dataset, which contains 60,000 instances in total and approximately 6,000 instances per class. The Gini coefficients were calculated using the flattened raw pixel values ($d = 784$) ranging from 0 to 255 for each $28 \times 28$ grayscale image. Each $d$-dimensional image vector was $\ell_2$-normalized before computing the similarity values and Gini coefficients. The Gini coefficients were $MinMax$ normalized $[0,1]$ to allow for comparison across classes.
  • ...and 10 more figures