Table of Contents
Fetching ...

Vocabulary-level Memory Efficiency for Language Model Fine-tuning

Miles Williams, Nikolaos Aletras

TL;DR

This work tackles the large memory footprint of fine-tuning language models by revealing that most downstream tasks utilize only a small portion of the vocabulary. It introduces Partial Embedding Matrix Adaptation (PEMA), which selectively excludes embeddings not touched during fine-tuning and merges them back afterward to preserve the complete vocabulary. Empirical results across GLUE, XNLI, and a spectrum of monolingual and multilingual models show substantial reductions in embedding memory, with larger vocabularies and models yielding bigger savings, while maintaining task performance. The approach is orthogonal to existing memory-saving techniques and enables more efficient use of computational resources, with particular promise for multilingual settings and future exploration of output embedding memory.

Abstract

The extensive memory footprint of language model (LM) fine-tuning poses a challenge for both researchers and practitioners. LMs use an embedding matrix to represent extensive vocabularies, forming a substantial proportion of the model parameters. While previous work towards memory-efficient fine-tuning has focused on minimizing the number of trainable parameters, reducing the memory footprint of the embedding matrix has yet to be explored. We first demonstrate that a significant proportion of the vocabulary remains unused during fine-tuning. We then propose a simple yet effective approach that leverages this finding to minimize memory usage. We show that our approach provides substantial reductions in memory usage across a wide range of models and tasks. Notably, our approach does not impact downstream task performance, while allowing more efficient use of computational resources.

Vocabulary-level Memory Efficiency for Language Model Fine-tuning

TL;DR

This work tackles the large memory footprint of fine-tuning language models by revealing that most downstream tasks utilize only a small portion of the vocabulary. It introduces Partial Embedding Matrix Adaptation (PEMA), which selectively excludes embeddings not touched during fine-tuning and merges them back afterward to preserve the complete vocabulary. Empirical results across GLUE, XNLI, and a spectrum of monolingual and multilingual models show substantial reductions in embedding memory, with larger vocabularies and models yielding bigger savings, while maintaining task performance. The approach is orthogonal to existing memory-saving techniques and enables more efficient use of computational resources, with particular promise for multilingual settings and future exploration of output embedding memory.

Abstract

The extensive memory footprint of language model (LM) fine-tuning poses a challenge for both researchers and practitioners. LMs use an embedding matrix to represent extensive vocabularies, forming a substantial proportion of the model parameters. While previous work towards memory-efficient fine-tuning has focused on minimizing the number of trainable parameters, reducing the memory footprint of the embedding matrix has yet to be explored. We first demonstrate that a significant proportion of the vocabulary remains unused during fine-tuning. We then propose a simple yet effective approach that leverages this finding to minimize memory usage. We show that our approach provides substantial reductions in memory usage across a wide range of models and tasks. Notably, our approach does not impact downstream task performance, while allowing more efficient use of computational resources.
Paper Structure (33 sections, 2 figures, 10 tables)

This paper contains 33 sections, 2 figures, 10 tables.

Figures (2)

  • Figure 1: Memory-efficient language model fine-tuning with Partial Embedding Matrix Adaptation (PEMA).
  • Figure 2: The trend in vocabulary use for the datasets in GLUE when using the vocabulary from GPT-2.