Table of Contents
Fetching ...

The Ups and Downs of Large Language Model Inference with Vocabulary Trimming by Language Heuristics

Nikolay Bogoychev, Pinzhen Chen, Barry Haddow, Alexandra Birch

TL;DR

This work investigates vocabulary trimming (VT) for large language models by tailoring the output vocabulary to a target language using two heuristics: Unicode-based script filtering and corpus-based selection. Through experiments on BLOOM and LLaMA-7B across Bulgarian, Chinese, English, and Spanish, the authors quantify memory savings (up to ~50% for small models) and potential speedups (upper bound ~25%), while highlighting inconsistencies across languages and diminishing returns for larger models. The results show VT can reduce memory footprint significantly for small models and offer selective speed benefits, but non-Latin languages and code-mixed scenarios pose limitations, and gains are generally smaller on larger models and on GPUs. Overall, VT is a promising orthogonal approach to efficiency that can complement other techniques, with practical impact for deployment under language-specific constraints and further research needed for broader language coverage.

Abstract

Deploying large language models (LLMs) encounters challenges due to intensive computational and memory requirements. Our research examines vocabulary trimming (VT) inspired by restricting embedding entries to the language of interest to bolster time and memory efficiency. While such modifications have been proven effective in tasks like machine translation, tailoring them to LLMs demands specific modifications given the diverse nature of LLM applications. We apply two language heuristics to trim the full vocabulary - Unicode-based script filtering and corpus-based selection - to different LLM families and sizes. The methods are straightforward, interpretable, and easy to implement. It is found that VT reduces the memory usage of small models by nearly 50% and has an upper bound of 25% improvement in generation speed. Yet, we reveal the limitations of these methods in that they do not perform consistently well for each language with diminishing returns in larger models.

The Ups and Downs of Large Language Model Inference with Vocabulary Trimming by Language Heuristics

TL;DR

This work investigates vocabulary trimming (VT) for large language models by tailoring the output vocabulary to a target language using two heuristics: Unicode-based script filtering and corpus-based selection. Through experiments on BLOOM and LLaMA-7B across Bulgarian, Chinese, English, and Spanish, the authors quantify memory savings (up to ~50% for small models) and potential speedups (upper bound ~25%), while highlighting inconsistencies across languages and diminishing returns for larger models. The results show VT can reduce memory footprint significantly for small models and offer selective speed benefits, but non-Latin languages and code-mixed scenarios pose limitations, and gains are generally smaller on larger models and on GPUs. Overall, VT is a promising orthogonal approach to efficiency that can complement other techniques, with practical impact for deployment under language-specific constraints and further research needed for broader language coverage.

Abstract

Deploying large language models (LLMs) encounters challenges due to intensive computational and memory requirements. Our research examines vocabulary trimming (VT) inspired by restricting embedding entries to the language of interest to bolster time and memory efficiency. While such modifications have been proven effective in tasks like machine translation, tailoring them to LLMs demands specific modifications given the diverse nature of LLM applications. We apply two language heuristics to trim the full vocabulary - Unicode-based script filtering and corpus-based selection - to different LLM families and sizes. The methods are straightforward, interpretable, and easy to implement. It is found that VT reduces the memory usage of small models by nearly 50% and has an upper bound of 25% improvement in generation speed. Yet, we reveal the limitations of these methods in that they do not perform consistently well for each language with diminishing returns in larger models.
Paper Structure (20 sections, 4 tables)