Neural Speed Reading via Skim-RNN
Minjoon Seo, Sewon Min, Ali Farhadi, Hannaneh Hajishirzi
TL;DR
Skim-RNN introduces a per-token skim/read mechanism that uses a small RNN to update only a portion of the hidden state for unimportant inputs, while a full RNN processes important tokens. Trained with a differentiable Gumbel-softmax reparameterization and an auxiliary skim loss, the model achieves substantial FLOP reductions (up to 3x in classification and over 1.4x in QA) with maintained or improved accuracy. It preserves standard RNN interfaces, enabling easy replacement in existing models, and demonstrates CPU-friendly latency advantages over GPU baselines in several settings. The approach offers a tunable speed/accuracy trade-off at inference time and shows promise for efficient sequence modeling on both classification and question-answering tasks.
Abstract
Inspired by the principles of speed reading, we introduce Skim-RNN, a recurrent neural network (RNN) that dynamically decides to update only a small fraction of the hidden state for relatively unimportant input tokens. Skim-RNN gives computational advantage over an RNN that always updates the entire hidden state. Skim-RNN uses the same input and output interfaces as a standard RNN and can be easily used instead of RNNs in existing models. In our experiments, we show that Skim-RNN can achieve significantly reduced computational cost without losing accuracy compared to standard RNNs across five different natural language tasks. In addition, we demonstrate that the trade-off between accuracy and speed of Skim-RNN can be dynamically controlled during inference time in a stable manner. Our analysis also shows that Skim-RNN running on a single CPU offers lower latency compared to standard RNNs on GPUs.
