VocabTailor: Dynamic Vocabulary Selection for Downstream Tasks in Small Language Models
Hanling Zhang, Yayu Zhou, Tongcheng Fang, Zhihang Yuan, Guohao Dai, Wanli Ouyang, Yu Wang
TL;DR
VocabTailor addresses memory bottlenecks in small language models by decoupling vocabulary components and employing a hybrid static-dynamic strategy guided by lexical locality and computation asymmetry. It offloads embeddings to CPU or disk-backed storage and maintains a compact, task-specific static LM head while dynamically loading input-relevant head tokens on demand, reducing memory usage substantially with minimal degradation in downstream tasks. Across five tasks, it achieves up to 99% reduction in vocabulary-related memory and demonstrates robust performance, particularly in information extraction and code-related generation, where input-token overlap is high. The approach offers practical benefits for on-device inference and edge deployments, enabling more memory-efficient yet flexible SLMs with broad applicability and tunable accuracy-memory trade-offs.
Abstract
Small Language Models (SLMs) provide computational advantages in resource-constrained environments, yet memory limitations remain a critical bottleneck for edge device deployment. A substantial portion of SLMs' memory footprint stems from vocabulary-related components, particularly embeddings and language modeling (LM) heads, due to large vocabulary sizes. Existing static vocabulary pruning, while reducing memory usage, suffers from rigid, one-size-fits-all designs that cause information loss from the prefill stage and a lack of flexibility. In this work, we identify two key principles underlying the vocabulary reduction challenge: the lexical locality principle, the observation that only a small subset of tokens is required during any single inference, and the asymmetry in computational characteristics between vocabulary-related components of SLM. Based on these insights, we introduce VocabTailor, a novel decoupled dynamic vocabulary selection framework that addresses memory constraints through offloading embedding and implements a hybrid static-dynamic vocabulary selection strategy for LM Head, enabling on-demand loading of vocabulary components. Comprehensive experiments across diverse downstream tasks demonstrate that VocabTailor achieves a reduction of up to 99% in the memory usage of vocabulary-related components with minimal or no degradation in task performance, substantially outperforming existing static vocabulary pruning.
