Table of Contents
Fetching ...

Why not to use Cosine Similarity between Label Representations

Beatrix M. G. Nielsen

Abstract

Cosine similarity is often used to measure the similarity of vectors. These vectors might be the representations of neural network models. However, it is not guaranteed that cosine similarity of model representations will tell us anything about model behaviour. In this paper we show that when using a softmax classifier, be it an image classifier or an autoregressive language model, measuring the cosine similarity between label representations (called unembeddings in the paper) does not give any information on the probabilities assigned by the model. Specifically, we prove that for any softmax classifier model, given two label representations, it is possible to make another model which gives the same probabilities for all labels and inputs, but where the cosine similarity between the representations is now either 1 or -1. We give specific examples of models with very high or low cosine simlarity between representations and show how to we can make equivalent models where the cosine similarity is now -1 or 1. This translation ambiguity can be fixed by centering the label representations, however, labels with representations with low cosine similarity can still have high probability for the same inputs. Fixing the length of the representations still does not give a guarantee that high(or low) cosine similarity will give high(or low) probability to the labels for the same inputs. This means that when working with softmax classifiers, cosine similarity values between label representations should not be used to explain model probabilities.

Why not to use Cosine Similarity between Label Representations

Abstract

Cosine similarity is often used to measure the similarity of vectors. These vectors might be the representations of neural network models. However, it is not guaranteed that cosine similarity of model representations will tell us anything about model behaviour. In this paper we show that when using a softmax classifier, be it an image classifier or an autoregressive language model, measuring the cosine similarity between label representations (called unembeddings in the paper) does not give any information on the probabilities assigned by the model. Specifically, we prove that for any softmax classifier model, given two label representations, it is possible to make another model which gives the same probabilities for all labels and inputs, but where the cosine similarity between the representations is now either 1 or -1. We give specific examples of models with very high or low cosine simlarity between representations and show how to we can make equivalent models where the cosine similarity is now -1 or 1. This translation ambiguity can be fixed by centering the label representations, however, labels with representations with low cosine similarity can still have high probability for the same inputs. Fixing the length of the representations still does not give a guarantee that high(or low) cosine similarity will give high(or low) probability to the labels for the same inputs. This means that when working with softmax classifiers, cosine similarity values between label representations should not be used to explain model probabilities.

Paper Structure

This paper contains 10 sections, 4 theorems, 15 equations, 2 figures.

Key Result

Lemma 2.1

Let $v\in \mathbb{R}^d$. For a softmax classifier as in def:softmax_classifier, adding $\mathbf{v}$ to all unembeddings, does not change the probability $p(y\vert \mathbf{x})$.

Figures (2)

  • Figure 1: Example of three models which give same probabilities to all labels for all inputs, but where cosine similarities differ between the unembeddings. a) Embeddings coloured by highest probability label. These are the same for all three models. b) Unembeddings for model 1. Cosine between labels $0$ and $1$ is about $0.8$. c) Unembeddings for model 2. Cosine between labels $0$ and $1$ is $-1$. d) Unembeddings for model 3. Cosine between labels $0$ and $1$ is $1$.
  • Figure 2: Examples of models with centered unembeddings. In both cases, we see that a high (or low) cosine similarity between unembeddings does not guarantee that the corresponding labels will have high (or low) probability for possible inputs. Left: These unembeddings are centered. Right: These unembeddings are centered and have length $1$.

Theorems & Definitions (8)

  • Lemma 2.1
  • proof
  • Lemma 2.2
  • proof
  • Lemma 2.3
  • proof
  • Theorem 2.4
  • proof