Table of Contents
Fetching ...

Improving FIM Code Completions via Context & Curriculum Based Learning

Hitesh Sagtani, Rishabh Mehrotra, Beyang Liu

TL;DR

This work tackles latency-constrained code completion by enhancing Fill-in-the-Middle (FIM) models through curriculum-aware training and context-aware fine-tuning. By extracting hard negative patterns via curriculum sampling and enriching inputs with repository context using ASTs, symbol graphs, and compiler-based symbol definitions, the authors fine-tune StarCoder and DeepSeek variants with a Curriculum-Context Merged Fine-Tuning (CMFT) approach. Offline benchmarks (Single-Line FIM, CrossCodeEval, and a new Multi-Line Infilling dataset) and online A/B tests demonstrate that CMFT yields consistent gains, especially for smaller models, while maintaining low latency. The results show improved Completion Acceptance Rate (CAR) and Completion Persistence Rate (CPR) in real-world settings, validating the practicality of the approach for real-time coding assistance and informing directions for future interpretability and metric development.

Abstract

Fill-in-the-Middle (FIM) models play a vital role in code completion tasks, leveraging both prefix and suffix context to provide more accurate and contextually relevant suggestions. This paper presents approaches to improve FIM code completion while addressing the challenge of maintaining low latency for real-time coding assistance. We enhance FIM code completion by incorporating context and curriculum examples in the training process. We identify patterns where completion suggestions fail more frequently, revealing complexities that smaller language models struggle with. To address these challenges, we develop a curriculum dataset by extracting hard-to-complete patterns from code repositories and generate context examples using semantic and static analysis tools (e.g. TSC compiler). We fine-tune various sized models, including StarCoder and DeepSeek, on this enhanced dataset. Our evaluation encompasses three key dimensions: the Santa Coder FIM task, the Amazon CCEval benchmark, and a new Multi-Line Infilling evaluation benchmark derived from SWE-bench. Comprehensive ablation studies across multiple model sizes reveal that while all fine-tuned models show improvements, the performance gains are more pronounced for smaller parameter models and incorporating difficult-to-complete examples, as part of curriculum learning, improves the code completion performance. This finding is particularly significant given the latency constraints of code completion tasks. While larger models like GPT and Claude perform well in multi-line completions but are prohibitively challenging to use given high latency, and our fine-tuned models achieve a balance between performance and latency. Finally, we validate our approach through online A/B testing, demonstrating tangible improvements in Completion Acceptance Rate (CAR) and Completion Persistence Rate (CPR), with zero latency impact.

Improving FIM Code Completions via Context & Curriculum Based Learning

TL;DR

This work tackles latency-constrained code completion by enhancing Fill-in-the-Middle (FIM) models through curriculum-aware training and context-aware fine-tuning. By extracting hard negative patterns via curriculum sampling and enriching inputs with repository context using ASTs, symbol graphs, and compiler-based symbol definitions, the authors fine-tune StarCoder and DeepSeek variants with a Curriculum-Context Merged Fine-Tuning (CMFT) approach. Offline benchmarks (Single-Line FIM, CrossCodeEval, and a new Multi-Line Infilling dataset) and online A/B tests demonstrate that CMFT yields consistent gains, especially for smaller models, while maintaining low latency. The results show improved Completion Acceptance Rate (CAR) and Completion Persistence Rate (CPR) in real-world settings, validating the practicality of the approach for real-time coding assistance and informing directions for future interpretability and metric development.

Abstract

Fill-in-the-Middle (FIM) models play a vital role in code completion tasks, leveraging both prefix and suffix context to provide more accurate and contextually relevant suggestions. This paper presents approaches to improve FIM code completion while addressing the challenge of maintaining low latency for real-time coding assistance. We enhance FIM code completion by incorporating context and curriculum examples in the training process. We identify patterns where completion suggestions fail more frequently, revealing complexities that smaller language models struggle with. To address these challenges, we develop a curriculum dataset by extracting hard-to-complete patterns from code repositories and generate context examples using semantic and static analysis tools (e.g. TSC compiler). We fine-tune various sized models, including StarCoder and DeepSeek, on this enhanced dataset. Our evaluation encompasses three key dimensions: the Santa Coder FIM task, the Amazon CCEval benchmark, and a new Multi-Line Infilling evaluation benchmark derived from SWE-bench. Comprehensive ablation studies across multiple model sizes reveal that while all fine-tuned models show improvements, the performance gains are more pronounced for smaller parameter models and incorporating difficult-to-complete examples, as part of curriculum learning, improves the code completion performance. This finding is particularly significant given the latency constraints of code completion tasks. While larger models like GPT and Claude perform well in multi-line completions but are prohibitively challenging to use given high latency, and our fine-tuned models achieve a balance between performance and latency. Finally, we validate our approach through online A/B testing, demonstrating tangible improvements in Completion Acceptance Rate (CAR) and Completion Persistence Rate (CPR), with zero latency impact.

Paper Structure

This paper contains 29 sections, 7 figures, 6 tables.

Figures (7)

  • Figure 1: Illustrative examples of various AST node types. The cursor position highlights the position where completions are triggered, with the node type indicated at the bottom right of each example.
  • Figure 2: % Relative CAR at AST nodes: Negative CAR values for node types like Call Expression and Parameters suggest their higher complexity than average.
  • Figure 3: Motivating example for Curriculum learning: Model fails to predict completion at Call Expression node, with nested symbol countryCode as an argument.
  • Figure 4: Motivating example for Context learning: Small models often fails to predict completion even with relevant context in prompt.
  • Figure 5: Data generation pipeline illustrating the extraction of curriculum and context examples from source files for Call Expression node type, using tree-sitter for node extraction and TSC Compiler API for precise symbol definitions.
  • ...and 2 more figures