Table of Contents
Fetching ...

Empowering Character-level Text Infilling by Eliminating Sub-Tokens

Houxing Ren, Mingjie Zhan, Zhongyuan Wu, Hongsheng Li

TL;DR

Decoder-only models struggle with character-level infilling because sub-token predictions at boundaries induce high perplexity and inconsistent training signals. The authors propose FIM-SE, which reframes training so that infilling is done at the line level using Start/End constraints (L-Prefix and F-Suffix), thereby avoiding sub-token predictions during inference while maintaining character-level capabilities. Empirical results on Humaneval-based infilling benchmarks and code-generation tasks show robust gains across models (e.g., Code Llama 13B single-line +11.5%, multi-line +10.7%, random-span improvements across models) with minimal impact on code generation, supported by a theoretical account of inconsistent labels and a post-check mechanism that guides reliable completions. Overall, FIM-SE advances practical, line-aware infilling for both natural language and code, enabling better editing and completion workflows and enabling transfer across infilling granularities.

Abstract

In infilling tasks, sub-tokens, representing instances where a complete token is segmented into two parts, often emerge at the boundaries of prefixes, middles, and suffixes. Traditional methods focused on training models at the token level, leading to sub-optimal performance in character-level infilling tasks during the inference stage. Alternately, some approaches considered character-level infilling, but they relied on predicting sub-tokens in inference, yet this strategy diminished ability in character-level infilling tasks due to the large perplexity of the model on sub-tokens. In this paper, we introduce FIM-SE, which stands for Fill-In-the-Middle with both Starting and Ending character constraints. The proposed method addresses character-level infilling tasks by utilizing a line-level format to avoid predicting any sub-token in inference. In addition, we incorporate two special tokens to signify the rest of the incomplete lines, thereby enhancing generation guidance. Extensive experiments demonstrate that our proposed approach surpasses previous methods, offering a significant advantage. Code is available at https://github.com/SenseLLM/FIM-SE.

Empowering Character-level Text Infilling by Eliminating Sub-Tokens

TL;DR

Decoder-only models struggle with character-level infilling because sub-token predictions at boundaries induce high perplexity and inconsistent training signals. The authors propose FIM-SE, which reframes training so that infilling is done at the line level using Start/End constraints (L-Prefix and F-Suffix), thereby avoiding sub-token predictions during inference while maintaining character-level capabilities. Empirical results on Humaneval-based infilling benchmarks and code-generation tasks show robust gains across models (e.g., Code Llama 13B single-line +11.5%, multi-line +10.7%, random-span improvements across models) with minimal impact on code generation, supported by a theoretical account of inconsistent labels and a post-check mechanism that guides reliable completions. Overall, FIM-SE advances practical, line-aware infilling for both natural language and code, enabling better editing and completion workflows and enabling transfer across infilling granularities.

Abstract

In infilling tasks, sub-tokens, representing instances where a complete token is segmented into two parts, often emerge at the boundaries of prefixes, middles, and suffixes. Traditional methods focused on training models at the token level, leading to sub-optimal performance in character-level infilling tasks during the inference stage. Alternately, some approaches considered character-level infilling, but they relied on predicting sub-tokens in inference, yet this strategy diminished ability in character-level infilling tasks due to the large perplexity of the model on sub-tokens. In this paper, we introduce FIM-SE, which stands for Fill-In-the-Middle with both Starting and Ending character constraints. The proposed method addresses character-level infilling tasks by utilizing a line-level format to avoid predicting any sub-token in inference. In addition, we incorporate two special tokens to signify the rest of the incomplete lines, thereby enhancing generation guidance. Extensive experiments demonstrate that our proposed approach surpasses previous methods, offering a significant advantage. Code is available at https://github.com/SenseLLM/FIM-SE.
Paper Structure (22 sections, 14 equations, 4 figures, 7 tables)

This paper contains 22 sections, 14 equations, 4 figures, 7 tables.

Figures (4)

  • Figure 1: The probabilities of prediction when inconsistent labels appear in the training data.
  • Figure 2: An overview of the difference between FIM and the proposed FIM-SE. Here, the green background indicates vanilla FIM and the blue background indicates our FIM-SE.
  • Figure 3: Performance on Humaneval random-span infilling task with different temperatures. The line denotes the difference between FIM-SE and FIM. Note that when the temperature surpasses 1.4, both models output noisy text and show very low performance.
  • Figure 4: Statistics of length of L-Prefix and F-Suffix.