Empowering Character-level Text Infilling by Eliminating Sub-Tokens
Houxing Ren, Mingjie Zhan, Zhongyuan Wu, Hongsheng Li
TL;DR
Decoder-only models struggle with character-level infilling because sub-token predictions at boundaries induce high perplexity and inconsistent training signals. The authors propose FIM-SE, which reframes training so that infilling is done at the line level using Start/End constraints (L-Prefix and F-Suffix), thereby avoiding sub-token predictions during inference while maintaining character-level capabilities. Empirical results on Humaneval-based infilling benchmarks and code-generation tasks show robust gains across models (e.g., Code Llama 13B single-line +11.5%, multi-line +10.7%, random-span improvements across models) with minimal impact on code generation, supported by a theoretical account of inconsistent labels and a post-check mechanism that guides reliable completions. Overall, FIM-SE advances practical, line-aware infilling for both natural language and code, enabling better editing and completion workflows and enabling transfer across infilling granularities.
Abstract
In infilling tasks, sub-tokens, representing instances where a complete token is segmented into two parts, often emerge at the boundaries of prefixes, middles, and suffixes. Traditional methods focused on training models at the token level, leading to sub-optimal performance in character-level infilling tasks during the inference stage. Alternately, some approaches considered character-level infilling, but they relied on predicting sub-tokens in inference, yet this strategy diminished ability in character-level infilling tasks due to the large perplexity of the model on sub-tokens. In this paper, we introduce FIM-SE, which stands for Fill-In-the-Middle with both Starting and Ending character constraints. The proposed method addresses character-level infilling tasks by utilizing a line-level format to avoid predicting any sub-token in inference. In addition, we incorporate two special tokens to signify the rest of the incomplete lines, thereby enhancing generation guidance. Extensive experiments demonstrate that our proposed approach surpasses previous methods, offering a significant advantage. Code is available at https://github.com/SenseLLM/FIM-SE.
