Torus embeddings

Dan Stowell

Torus embeddings

Dan Stowell

TL;DR

The torus embedding does not outperform hypersphere embeddings in general, but is comparable, and opens the possibility to train deep embeddings which have an extremely simple pathway to efficient `TinyML'embedded implementation.

Abstract

Many data representations are vectors of continuous values. In particular, deep learning embeddings are data-driven representations, typically either unconstrained in Euclidean space, or constrained to a hypersphere. These may also be translated into integer representations (quantised) for efficient large-scale use. However, the fundamental (and most efficient) numeric representation in the overwhelming majority of existing computers is integers with overflow -- and vectors of these integers do not correspond to either of these spaces, but instead to the topology of a (hyper)torus. This mismatch can lead to wasted representation capacity. Here we show that common deep learning frameworks can be adapted, quite simply, to create representations with inherent toroidal topology. We investigate two alternative strategies, demonstrating that a normalisation-based strategy leads to training with desirable stability and performance properties, comparable to a standard hyperspherical L2 normalisation. We also demonstrate that a torus embedding maintains desirable quantisation properties. The torus embedding does not outperform hypersphere embeddings in general, but is comparable, and opens the possibility to train deep embeddings which have an extremely simple pathway to efficient `TinyML' embedded implementation.

Torus embeddings

TL;DR

Abstract

Paper Structure (10 sections, 7 equations, 9 figures, 3 tables)

This paper contains 10 sections, 7 equations, 9 figures, 3 tables.

Introduction
Topologies of Deep Embedding Spaces
Training Torus Representations
Additional Training Considerations
Inference in Torus Embeddings
Experiments
Discussion
Conclusions
Distances in hyperspheres and hypertori
Spreading vectors: KoLeo regularisation and circular variance

Figures (9)

Figure 1: Extremely large gradient updates have different effects in the Clifford torus, depending on whether produced by L2 normalisation or Clifford projection. This illustration shows a single update in a figurative Clifford torus (upper, circle) and a 'view' of the same as a flat torus (lower, square). Under L2-normalisation (left), a gradient update is applied in the extrinsic space (solid arrow) before normalisation (dashed arrow), with the side-effect that gradient updates of unbounded size will have a stable impact---in fact, asymptotically limited to a 90 degree angle. In the Clifford-projected space (right), very large gradient updates simply 'wrap around', which means that very small deviations in large steps (e.g. floating point rounding errors) may lead to unstable updates.
Figure 2: The two methods for mapping data onto a hypertorus use very different mappings, as in these 2D illustrations. Left panel: $L_{2p}$ normalisation maps 2D points to their angle with respect to the origin. Centre and right panels: the Clifford projection transforms each dimension separately into a cyclical version of itself.
Figure 3: Classification performance (precision at 1) on the CIFAR10 and CIFAR100 data sets, using nearest-neighbour classification, comparing embedding spaces that are hyperspherical versus hypertoroidal. The x-axis shows variation in the embedding dimensionality, and (nested inside) variation in the strength of the KoLeo regularisation (the number is the $\log_{10}$ of the regularisation strength).
Figure 4: Classification performance as in Figure \ref{['fig:plot_evalmeasures_precision_at_1_raw']} (CIFAR100 data set), after quantisation by various methods. The 8-bit quantisation has only a mild impact on performance, but other methods degrade the classifier precision to varying extents.
Figure 5: Birdsong audio classification performance (accuracy) in a few-shot setting, comparing embedding spaces that are hyperspherical versus hypertoroidal. Scores shown are for each of six birdsong datasets provided by BIRB (left to right), with the overall average summarised in the rightmost panel. A dark gray line indicates chance performance for each dataset. The x-axis shows variation in the embedding dimensionality. Upper: 1-shot. Lower: 5-shot.
...and 4 more figures

Torus embeddings

TL;DR

Abstract

Torus embeddings

Authors

TL;DR

Abstract

Table of Contents

Figures (9)