Gold Panning in Vocabulary: An Adaptive Method for Vocabulary Expansion of Domain-Specific LLMs

Chengyuan Liu; Shihang Wang; Lizhi Qing; Kun Kuang; Yangyang Kang; Changlong Sun; Fei Wu

Gold Panning in Vocabulary: An Adaptive Method for Vocabulary Expansion of Domain-Specific LLMs

Chengyuan Liu, Shihang Wang, Lizhi Qing, Kun Kuang, Yangyang Kang, Changlong Sun, Fei Wu

TL;DR

VEGAD is introduced, an adaptive method that automatically identifies valuable words from a given domain vocabulary that has been validated through experiments on three Chinese datasets, demonstrating its effectiveness and enhancing performance on both domain-specific tasks and general tasks.

Abstract

While Large Language Models (LLMs) demonstrate impressive generation abilities, they frequently struggle when it comes to specialized domains due to their limited domain-specific knowledge. Studies on domain-specific LLMs resort to expanding the vocabulary before fine-tuning on domain-specific corpus, aiming to decrease the sequence length and enhance efficiency during decoding, without thoroughly investigating the results of vocabulary expansion to LLMs over different domains. Our pilot study reveals that expansion with only a subset of the entire vocabulary may lead to superior performance. Guided by the discovery, this paper explores how to identify a vocabulary subset to achieve the optimal results. We introduce VEGAD, an adaptive method that automatically identifies valuable words from a given domain vocabulary. Our method has been validated through experiments on three Chinese datasets, demonstrating its effectiveness. Additionally, we have undertaken comprehensive analyses of the method. The selection of a optimal subset for expansion has shown to enhance performance on both domain-specific tasks and general tasks, showcasing the potential of VEGAD.

Gold Panning in Vocabulary: An Adaptive Method for Vocabulary Expansion of Domain-Specific LLMs

TL;DR

Abstract

Paper Structure (32 sections, 9 equations, 7 figures, 14 tables, 3 algorithms)

This paper contains 32 sections, 9 equations, 7 figures, 14 tables, 3 algorithms.

Introduction
Related Work
Method
Build Trie
Gradient Calculation
Vocabulary Selection
Experiments
Baselines
General LLM
SFT
DV
SPM
ATT_EG and PATT_EG
Jieba
Main Results
...and 17 more sections

Figures (7)

Figure 1: Pilot study: Relative improvement comparing with direct supervised fine-tuning, by adding vocabulary with different sizes.
Figure 2: Framework of VEGAD.
Figure 3: Gradient Calculation for each candidate word. Given the Trie built from candidate vocabulary, we check whether there exists a sub-sequence of the input and output on the path from the root of the Trie to a leaf node, by a pointer. The trace of the pointer is illustrated by $V_i$ and the "pseudo-leaf node". Finally, the top $K$ words with the largest gradients are selected to construct the new vocabulary, and used to resize the embedding layer and language modeling head layer.
Figure 4: Relative improvement of VEGAD comparing with direct SFT, by adding vocabulary with different sizes.
Figure 5: Results comparison with 2-gram.
...and 2 more figures

Gold Panning in Vocabulary: An Adaptive Method for Vocabulary Expansion of Domain-Specific LLMs

TL;DR

Abstract

Gold Panning in Vocabulary: An Adaptive Method for Vocabulary Expansion of Domain-Specific LLMs

Authors

TL;DR

Abstract

Table of Contents

Figures (7)