Table of Contents
Fetching ...

A Coding-Theoretic Analysis of Hyperspherical Prototypical Learning Geometry

Martin Lindström, Borja Rodríguez-Gálvez, Ragnar Thobaben, Mikael Skoglund

TL;DR

This work addresses the problem of designing class prototypes on the unit hypersphere to maximize separation in supervised representation learning. It introduces coding-theoretic constructions that map binary linear codes to hyperspherical prototypes, providing provable bounds via Gilbert-Varshamov and Rankin results and achieving near-orthogonality in many regimes, especially when the latent dimension satisfies $n \approx K/2$. It augments this with optimization-based prototype schemes using a convex log-sum-exp relaxation to approximate the nonconvex objective, enabling flexible trade-offs between dimension and separation. Empirical evaluation on CIFAR-100 and MNIST demonstrates that more dispersed prototypes tend to yield higher accuracy, though performance also hinges on the semantic alignment between classes and prototype assignments. Overall, the coding-theoretic approach offers scalable, near-optimal prototype designs across a broad range of dimensions, with clear directions for incorporating semantic information and extending to self-supervised learning contexts.

Abstract

Hyperspherical Prototypical Learning (HPL) is a supervised approach to representation learning that designs class prototypes on the unit hypersphere. The prototypes bias the representations to class separation in a scale invariant and known geometry. Previous approaches to HPL have either of the following shortcomings: (i) they follow an unprincipled optimisation procedure; or (ii) they are theoretically sound, but are constrained to only one possible latent dimension. In this paper, we address both shortcomings. To address (i), we present a principled optimisation procedure whose solution we show is optimal. To address (ii), we construct well-separated prototypes in a wide range of dimensions using linear block codes. Additionally, we give a full characterisation of the optimal prototype placement in terms of achievable and converse bounds, showing that our proposed methods are near-optimal.

A Coding-Theoretic Analysis of Hyperspherical Prototypical Learning Geometry

TL;DR

This work addresses the problem of designing class prototypes on the unit hypersphere to maximize separation in supervised representation learning. It introduces coding-theoretic constructions that map binary linear codes to hyperspherical prototypes, providing provable bounds via Gilbert-Varshamov and Rankin results and achieving near-orthogonality in many regimes, especially when the latent dimension satisfies . It augments this with optimization-based prototype schemes using a convex log-sum-exp relaxation to approximate the nonconvex objective, enabling flexible trade-offs between dimension and separation. Empirical evaluation on CIFAR-100 and MNIST demonstrates that more dispersed prototypes tend to yield higher accuracy, though performance also hinges on the semantic alignment between classes and prototype assignments. Overall, the coding-theoretic approach offers scalable, near-optimal prototype designs across a broad range of dimensions, with clear directions for incorporating semantic information and extending to self-supervised learning contexts.

Abstract

Hyperspherical Prototypical Learning (HPL) is a supervised approach to representation learning that designs class prototypes on the unit hypersphere. The prototypes bias the representations to class separation in a scale invariant and known geometry. Previous approaches to HPL have either of the following shortcomings: (i) they follow an unprincipled optimisation procedure; or (ii) they are theoretically sound, but are constrained to only one possible latent dimension. In this paper, we address both shortcomings. To address (i), we present a principled optimisation procedure whose solution we show is optimal. To address (ii), we construct well-separated prototypes in a wide range of dimensions using linear block codes. Additionally, we give a full characterisation of the optimal prototype placement in terms of achievable and converse bounds, showing that our proposed methods are near-optimal.
Paper Structure (27 sections, 6 theorems, 17 equations, 11 figures)

This paper contains 27 sections, 6 theorems, 17 equations, 11 figures.

Key Result

Lemma 2.1

There exists an $[n,k]$ code with minimum distance at least $d_\mathrm{min}$, provided that

Figures (11)

  • Figure 1: Prototypes on $\mathbb{S}^2$. The left image shows the naïve one-hot encoding approach, which has cosine similarity $0$. One can do much better with binary code-based prototypes. In the centre image, by reducing to two prototypes, the optimal distance with cosine similarity $-1$ is achieved. The right image shows how one can fit four prototypes with a better separation, arriving at a cosine similarity of $-1/3$.
  • Figure 2: Maximum pairwise cosine similarity of $K=100$ prototypes in various latent space dimensions (logarithmic scale). Coding-theoretic approaches provide additional flexibility over the kasarla_maximum_2022 mapping. The optimisation-based \ref{['eq:lse_problem']} prototypes achieve slightly better separation than the \ref{['eq:mettes_problem']} scheme. All schemes (except for RM codes with $n=32$) fall within the achievable and converse bounds from \ref{['thm:sandwich_thm']}. For large $n$, the BCH, RM, and \ref{['eq:mettes_problem']} prototypes yield no worse than othogonal prototypes. The \ref{['eq:lse_problem']} and kasarla_maximum_2022 prototypes achieve the converse bound and perform therefore slightly better.
  • Figure 3: Cosine similarity histograms for different prototype generation schemes for $K=100$ classes in dimension $n=16$. All schemes have average cosine similarity close to $0$. Note that there are $K/2$ RM code-based prototype pairs with cosine similarity $-1$ which have been omitted for clarity.
  • Figure 4: Top-1 accuracy results for CIFAR-$100$ for different prototype generation schemes, averaged over $5$ runs, with errorbars corresponding to one standard deviation.
  • Figure 5: Comparison of accuracy on CIFAR-$100$ and maximum cosine similarity between the $K=100$ prototypes. More dissimilar prototypes are correlated with higher accuracy, but there is significant variance within, and between, models corresponding to different prototype generation schemes. Note that the same maximum similarity may correspond to different $n$, see \ref{['fig:k100']}. Interestingly, some BCH codes in lower dimension and with worse cosine similarity yield better performance than BCH codes in higher dimension.
  • ...and 6 more figures

Theorems & Definitions (12)

  • Lemma 2.1: Gilbert-Varshamov Bound
  • Remark 2.2
  • Proposition 3.1
  • proof
  • Lemma 3.2: Separation Guarantees for RM Codes
  • proof
  • Remark 3.3
  • Lemma 3.4: Rankin Bound
  • Theorem 3.5
  • proof
  • ...and 2 more