Table of Contents
Fetching ...

IGOT: Information Gain Optimized Tokenizer on Domain Adaptive Pretraining

Dawei Feng, Yihai Zhang, Zhixuan Xu

TL;DR

IGOT addresses domain adaptation for LLMs by tailoring tokenization to domain-specific vocabulary, enabling more efficient domain-adaptive pretraining. It defines token information gain and uses a learned heuristic to curate a domain-specific tokenizer (IGOT and IGOT$_{\tau}$), integrated into a domain-adaptive pretraining objective. The approach yields substantial practical benefits, including reduced token usage and training time across models such as LLaMA2-7B and T5, while maintaining or enhancing domain knowledge capture. Overall, IGOT demonstrates that customized tokenization can significantly improve the efficiency and effectiveness of deploying general generative AI in specialized domains.

Abstract

Pretrained Large Language Models (LLM) such as ChatGPT, Claude, etc. have demonstrated strong capabilities in various fields of natural language generation. However, there are still many problems when using LLM in specialized domain-specific fields. When using generative AI to process downstream tasks, a common approach is to add new knowledge (e.g., private domain knowledge, cutting-edge information) to a pretrained model through continued training or fine-tuning. However, whether there is a universal paradigm for domain adaptation training is still an open question. In this article, we proposed Information Gain Optimized Tokenizer (IGOT), which analyzes the special token set of downstream tasks, constructs a new subset using heuristic function $φ$ with the special token and its information gain, to build new domain-specific tokenizer, and continues pretraining on the downstream task data. We explored the many positive effects of this method's customized tokenizer on domain-adaptive pretraining and verified this method can perform better than the ordinary method of just collecting data and fine-tuning. Based on our experiment, the continued pretraining process of IGOT with LLaMA-7B achieved 11.9\% token saving, 12.2\% training time saving, and 5.8\% maximum GPU VRAM usage saving, combined with the T5 model, we can even reach a 31.5\% of training time saving, making porting general generative AI to specific domains more effective than before. In domain-specific tasks, supervised $IGOT_τ$ shows great performance on reducing both the convergence radius and convergence point during keep pretraining.

IGOT: Information Gain Optimized Tokenizer on Domain Adaptive Pretraining

TL;DR

IGOT addresses domain adaptation for LLMs by tailoring tokenization to domain-specific vocabulary, enabling more efficient domain-adaptive pretraining. It defines token information gain and uses a learned heuristic to curate a domain-specific tokenizer (IGOT and IGOT), integrated into a domain-adaptive pretraining objective. The approach yields substantial practical benefits, including reduced token usage and training time across models such as LLaMA2-7B and T5, while maintaining or enhancing domain knowledge capture. Overall, IGOT demonstrates that customized tokenization can significantly improve the efficiency and effectiveness of deploying general generative AI in specialized domains.

Abstract

Pretrained Large Language Models (LLM) such as ChatGPT, Claude, etc. have demonstrated strong capabilities in various fields of natural language generation. However, there are still many problems when using LLM in specialized domain-specific fields. When using generative AI to process downstream tasks, a common approach is to add new knowledge (e.g., private domain knowledge, cutting-edge information) to a pretrained model through continued training or fine-tuning. However, whether there is a universal paradigm for domain adaptation training is still an open question. In this article, we proposed Information Gain Optimized Tokenizer (IGOT), which analyzes the special token set of downstream tasks, constructs a new subset using heuristic function with the special token and its information gain, to build new domain-specific tokenizer, and continues pretraining on the downstream task data. We explored the many positive effects of this method's customized tokenizer on domain-adaptive pretraining and verified this method can perform better than the ordinary method of just collecting data and fine-tuning. Based on our experiment, the continued pretraining process of IGOT with LLaMA-7B achieved 11.9\% token saving, 12.2\% training time saving, and 5.8\% maximum GPU VRAM usage saving, combined with the T5 model, we can even reach a 31.5\% of training time saving, making porting general generative AI to specific domains more effective than before. In domain-specific tasks, supervised shows great performance on reducing both the convergence radius and convergence point during keep pretraining.
Paper Structure (10 sections, 8 equations, 3 figures, 1 table)

This paper contains 10 sections, 8 equations, 3 figures, 1 table.

Figures (3)

  • Figure 1: An example of IGOT compared to LLaMA2 Tokenizer. Given the input "Introduce Openlane, an EDA tool", the original LLaMA2 tokenizer will split into 13 tokens including meaningless parts such as "rodu" or "ane", as shown in Fig. 1(a). However, our IGOT method allows the tokenizer to process it into 8 meaningful parts, keep proprietary words like OpenLane, and reach a 38.46% of space reduction.
  • Figure 2: In Figure (a), we showed the performance of the original LLaMA2-7B tokenizer during training. It is evident that even after the training reaches the third epoch, the loss oscillation remains severe. However, Figure (b) demonstrates that the IGOT method significantly reduces this oscillation, facilitating better model convergence.
  • Figure 3: Figure (a) depicts the distribution of information gain captured by IGOT on the complete dataset. Although many tokens with information gain greater than 4 are captured, the proportion is not high. Meanwhile, Figure (b) shows that the supervised $IGOT_\tau$ exhibits a more extensive distribution on high-gain tokens. This may be one of the reasons why the effectiveness of $IGOT_\tau$ is somewhat better than IGOT.