Deep Networks With Large Output Spaces
Sudheendra Vijayanarasimhan, Jonathon Shlens, Rajat Monga, Jay Yagnik
TL;DR
This work tackles the bottleneck of large-output-space neural networks by replacing exhaustive final-layer dot-product computations with a Winner-Take-All hashing scheme that retrieves a small top-$K$ set of candidate classes. Using hash tables and $P$ permutations, the method computes exact probabilities only for these candidates, enabling efficient training with downpour SGD and sparse gradient updates. Across Imagenet-21K, Skipgram, and Sports 1M, the approach yields substantial speedups (up to about 10x) with accuracy close to full softmax, demonstrating practical scalability for millions of output classes. The key contribution is a practical, scalable framework that leverages locality-sensitive hashing to decouple output space size from compute, with potential extensions to intermediate layers and larger convolutional filter banks.
Abstract
Deep neural networks have been extremely successful at various image, speech, video recognition tasks because of their ability to model deep structures within the data. However, they are still prohibitively expensive to train and apply for problems containing millions of classes in the output layer. Based on the observation that the key computation common to most neural network layers is a vector/matrix product, we propose a fast locality-sensitive hashing technique to approximate the actual dot product enabling us to scale up the training and inference to millions of output classes. We evaluate our technique on three diverse large-scale recognition tasks and show that our approach can train large-scale models at a faster rate (in terms of steps/total time) compared to baseline methods.
