Table of Contents
Fetching ...

VocabTailor: Dynamic Vocabulary Selection for Downstream Tasks in Small Language Models

Hanling Zhang, Yayu Zhou, Tongcheng Fang, Zhihang Yuan, Guohao Dai, Wanli Ouyang, Yu Wang

TL;DR

VocabTailor addresses memory bottlenecks in small language models by decoupling vocabulary components and employing a hybrid static-dynamic strategy guided by lexical locality and computation asymmetry. It offloads embeddings to CPU or disk-backed storage and maintains a compact, task-specific static LM head while dynamically loading input-relevant head tokens on demand, reducing memory usage substantially with minimal degradation in downstream tasks. Across five tasks, it achieves up to 99% reduction in vocabulary-related memory and demonstrates robust performance, particularly in information extraction and code-related generation, where input-token overlap is high. The approach offers practical benefits for on-device inference and edge deployments, enabling more memory-efficient yet flexible SLMs with broad applicability and tunable accuracy-memory trade-offs.

Abstract

Small Language Models (SLMs) provide computational advantages in resource-constrained environments, yet memory limitations remain a critical bottleneck for edge device deployment. A substantial portion of SLMs' memory footprint stems from vocabulary-related components, particularly embeddings and language modeling (LM) heads, due to large vocabulary sizes. Existing static vocabulary pruning, while reducing memory usage, suffers from rigid, one-size-fits-all designs that cause information loss from the prefill stage and a lack of flexibility. In this work, we identify two key principles underlying the vocabulary reduction challenge: the lexical locality principle, the observation that only a small subset of tokens is required during any single inference, and the asymmetry in computational characteristics between vocabulary-related components of SLM. Based on these insights, we introduce VocabTailor, a novel decoupled dynamic vocabulary selection framework that addresses memory constraints through offloading embedding and implements a hybrid static-dynamic vocabulary selection strategy for LM Head, enabling on-demand loading of vocabulary components. Comprehensive experiments across diverse downstream tasks demonstrate that VocabTailor achieves a reduction of up to 99% in the memory usage of vocabulary-related components with minimal or no degradation in task performance, substantially outperforming existing static vocabulary pruning.

VocabTailor: Dynamic Vocabulary Selection for Downstream Tasks in Small Language Models

TL;DR

VocabTailor addresses memory bottlenecks in small language models by decoupling vocabulary components and employing a hybrid static-dynamic strategy guided by lexical locality and computation asymmetry. It offloads embeddings to CPU or disk-backed storage and maintains a compact, task-specific static LM head while dynamically loading input-relevant head tokens on demand, reducing memory usage substantially with minimal degradation in downstream tasks. Across five tasks, it achieves up to 99% reduction in vocabulary-related memory and demonstrates robust performance, particularly in information extraction and code-related generation, where input-token overlap is high. The approach offers practical benefits for on-device inference and edge deployments, enabling more memory-efficient yet flexible SLMs with broad applicability and tunable accuracy-memory trade-offs.

Abstract

Small Language Models (SLMs) provide computational advantages in resource-constrained environments, yet memory limitations remain a critical bottleneck for edge device deployment. A substantial portion of SLMs' memory footprint stems from vocabulary-related components, particularly embeddings and language modeling (LM) heads, due to large vocabulary sizes. Existing static vocabulary pruning, while reducing memory usage, suffers from rigid, one-size-fits-all designs that cause information loss from the prefill stage and a lack of flexibility. In this work, we identify two key principles underlying the vocabulary reduction challenge: the lexical locality principle, the observation that only a small subset of tokens is required during any single inference, and the asymmetry in computational characteristics between vocabulary-related components of SLM. Based on these insights, we introduce VocabTailor, a novel decoupled dynamic vocabulary selection framework that addresses memory constraints through offloading embedding and implements a hybrid static-dynamic vocabulary selection strategy for LM Head, enabling on-demand loading of vocabulary components. Comprehensive experiments across diverse downstream tasks demonstrate that VocabTailor achieves a reduction of up to 99% in the memory usage of vocabulary-related components with minimal or no degradation in task performance, substantially outperforming existing static vocabulary pruning.

Paper Structure

This paper contains 40 sections, 2 equations, 7 figures, 9 tables, 1 algorithm.

Figures (7)

  • Figure 1: Overview of VocabTailor
  • Figure 2: Left: Input-output lexical overlap ratio. Right: Example of lexical overlap in a summarization task.
  • Figure 3: Comparison of dynamic LM head construction workflows. Left panel (Vanilla): The naive re-allocation approach where weights are concatenated and a new Linear module is initialized per request, incurring high movement and initialization costs. Middle panel (SplitLinear): A decoupled architecture where static and dynamic weights form independent Linear modules ($M_1, M_2$), allowing the static part to be pre-initialized on the GPU. Right panel (PreAlloc): Our optimized strategy using a pre-warmed GPU buffer. Static weights remain stationary while dynamic weights are copied into a zero-initialized buffer, eliminating module re-initialization and minimizing memory movement overhead.
  • Figure 4: Latency decomposition across different dynamic LM head construction approaches. PreAlloc (bottom) successfully reduces the total prefill time from 12.43s (Vanilla) back to 0.37s, effectively matching the latency profile of the original model.
  • Figure 5: Overview of the VocabTailor framework with disk-backed embedding offloading. The embedding layer is replaced with a custom embedding layer that retrieves the corresponding tokens from the LMDB when the forward is called.
  • ...and 2 more figures