Table of Contents
Fetching ...

Deep Networks With Large Output Spaces

Sudheendra Vijayanarasimhan, Jonathon Shlens, Rajat Monga, Jay Yagnik

TL;DR

This work tackles the bottleneck of large-output-space neural networks by replacing exhaustive final-layer dot-product computations with a Winner-Take-All hashing scheme that retrieves a small top-$K$ set of candidate classes. Using hash tables and $P$ permutations, the method computes exact probabilities only for these candidates, enabling efficient training with downpour SGD and sparse gradient updates. Across Imagenet-21K, Skipgram, and Sports 1M, the approach yields substantial speedups (up to about 10x) with accuracy close to full softmax, demonstrating practical scalability for millions of output classes. The key contribution is a practical, scalable framework that leverages locality-sensitive hashing to decouple output space size from compute, with potential extensions to intermediate layers and larger convolutional filter banks.

Abstract

Deep neural networks have been extremely successful at various image, speech, video recognition tasks because of their ability to model deep structures within the data. However, they are still prohibitively expensive to train and apply for problems containing millions of classes in the output layer. Based on the observation that the key computation common to most neural network layers is a vector/matrix product, we propose a fast locality-sensitive hashing technique to approximate the actual dot product enabling us to scale up the training and inference to millions of output classes. We evaluate our technique on three diverse large-scale recognition tasks and show that our approach can train large-scale models at a faster rate (in terms of steps/total time) compared to baseline methods.

Deep Networks With Large Output Spaces

TL;DR

This work tackles the bottleneck of large-output-space neural networks by replacing exhaustive final-layer dot-product computations with a Winner-Take-All hashing scheme that retrieves a small top- set of candidate classes. Using hash tables and permutations, the method computes exact probabilities only for these candidates, enabling efficient training with downpour SGD and sparse gradient updates. Across Imagenet-21K, Skipgram, and Sports 1M, the approach yields substantial speedups (up to about 10x) with accuracy close to full softmax, demonstrating practical scalability for millions of output classes. The key contribution is a practical, scalable framework that leverages locality-sensitive hashing to decouple output space size from compute, with potential extensions to intermediate layers and larger convolutional filter banks.

Abstract

Deep neural networks have been extremely successful at various image, speech, video recognition tasks because of their ability to model deep structures within the data. However, they are still prohibitively expensive to train and apply for problems containing millions of classes in the output layer. Based on the observation that the key computation common to most neural network layers is a vector/matrix product, we propose a fast locality-sensitive hashing technique to approximate the actual dot product enabling us to scale up the training and inference to millions of output classes. We evaluate our technique on three diverse large-scale recognition tasks and show that our approach can train large-scale models at a faster rate (in terms of steps/total time) compared to baseline methods.

Paper Structure

This paper contains 12 sections, 1 equation, 9 figures, 1 table.

Figures (9)

  • Figure 1: A diagram comparing a typical classification network trained with softmax with the proposed WTA softmax. The left column shows the operations of softmax $= f(WX + b)$. $X$ are the input network activations, $f(\cdot)$ is the softmax activation function, $b$ are biases for each of $N$ classes and $W$ is an $N\times d$ matrix where each row is the weight vector associated with an individual class. The matrix product $WX$ is the most expensive operation for the entire network when the number of classes $N$ is extremely large. The right column diagrams the WTA softmax operation. The hashing operation identifies the $K << N$ most likely labels for a given $X$. The remainder of the WTA softmax operations are largely identical although they only operate on the $K << N$ likely labels.
  • Figure 2: A schematic describing the use of the hash table during inference and training. The learned parameter vectors $W$ are stored in hash tables using WTA hashing and the hash codes of the input vector $x$ are used to retrieve the top $K$ classes with the largest dot products with the input vector. The actual dot product and the corresponding probabilites are computed for only these retrieved classes. Similarly, during the backward pass the gradients are computed based on the top $K$ retrieved nodes and the parameter vectors are updated.
  • Figure 3: Time taken by the WTA softmax layer and regular softmax layer alone for various values of batch size and top $K$ for a prediction space of 21K classes during inference. WTA provides significant speed-upds over softmax for small batch sizes and small values of $K$. Note that due to the sublinear nature of hash retreival, the speedups will be larger for bigger problems.
  • Figure 4: Accuracies obtained by the WTA model as the number of retrieved classes, $K$, is varied from 30 to 3000. Even with as few as 30 classes the WTA model is able to reach $83\%$ of the accuracy of the baseline model and almost reaches the baseline accuracy for $K = 3000$. Note that this result uses WTA to just approximate an already trained network and hence the ceiling is the accuracy of the base network.
  • Figure 5: Trade-off between the speed-up achieved over the baseline softmax model at a fixed batch size and the percentage of the baseline accuracy reached. WTA achieves a speed-up of 10x at $90\%$ of the baseline accuracy.
  • ...and 4 more figures