Empowering Character-level Text Infilling by Eliminating Sub-Tokens

Houxing Ren; Mingjie Zhan; Zhongyuan Wu; Hongsheng Li

Empowering Character-level Text Infilling by Eliminating Sub-Tokens

Houxing Ren, Mingjie Zhan, Zhongyuan Wu, Hongsheng Li

TL;DR

Decoder-only models struggle with character-level infilling because sub-token predictions at boundaries induce high perplexity and inconsistent training signals. The authors propose FIM-SE, which reframes training so that infilling is done at the line level using Start/End constraints (L-Prefix and F-Suffix), thereby avoiding sub-token predictions during inference while maintaining character-level capabilities. Empirical results on Humaneval-based infilling benchmarks and code-generation tasks show robust gains across models (e.g., Code Llama 13B single-line +11.5%, multi-line +10.7%, random-span improvements across models) with minimal impact on code generation, supported by a theoretical account of inconsistent labels and a post-check mechanism that guides reliable completions. Overall, FIM-SE advances practical, line-aware infilling for both natural language and code, enabling better editing and completion workflows and enabling transfer across infilling granularities.

Abstract

In infilling tasks, sub-tokens, representing instances where a complete token is segmented into two parts, often emerge at the boundaries of prefixes, middles, and suffixes. Traditional methods focused on training models at the token level, leading to sub-optimal performance in character-level infilling tasks during the inference stage. Alternately, some approaches considered character-level infilling, but they relied on predicting sub-tokens in inference, yet this strategy diminished ability in character-level infilling tasks due to the large perplexity of the model on sub-tokens. In this paper, we introduce FIM-SE, which stands for Fill-In-the-Middle with both Starting and Ending character constraints. The proposed method addresses character-level infilling tasks by utilizing a line-level format to avoid predicting any sub-token in inference. In addition, we incorporate two special tokens to signify the rest of the incomplete lines, thereby enhancing generation guidance. Extensive experiments demonstrate that our proposed approach surpasses previous methods, offering a significant advantage. Code is available at https://github.com/SenseLLM/FIM-SE.

Empowering Character-level Text Infilling by Eliminating Sub-Tokens

TL;DR

Abstract

Paper Structure (22 sections, 14 equations, 4 figures, 7 tables)

This paper contains 22 sections, 14 equations, 4 figures, 7 tables.

Introduction
Related Work
Large Language Models for Infilling
Text Infilling Models
Preliminaries
Fill-In-the-Middle (FIM)
Impact of Inconsistent Labels
Methodology
FIM-SE Training
FIM-SE Inference
Learning and Discussion
Experiments
Experimental Setup
Results
Detail Analysis
...and 7 more sections

Figures (4)

Figure 1: The probabilities of prediction when inconsistent labels appear in the training data.
Figure 2: An overview of the difference between FIM and the proposed FIM-SE. Here, the green background indicates vanilla FIM and the blue background indicates our FIM-SE.
Figure 3: Performance on Humaneval random-span infilling task with different temperatures. The line denotes the difference between FIM-SE and FIM. Note that when the temperature surpasses 1.4, both models output noisy text and show very low performance.
Figure 4: Statistics of length of L-Prefix and F-Suffix.

Empowering Character-level Text Infilling by Eliminating Sub-Tokens

TL;DR

Abstract

Empowering Character-level Text Infilling by Eliminating Sub-Tokens

Authors

TL;DR

Abstract

Table of Contents

Figures (4)