AI-assisted Code Authoring at Scale: Fine-tuning, deploying, and mixed methods evaluation

Vijayaraghavan Murali; Chandra Maddila; Imad Ahmad; Michael Bolin; Daniel Cheng; Negar Ghorbani; Renuka Fernandez; Nachiappan Nagappan; Peter C. Rigby

AI-assisted Code Authoring at Scale: Fine-tuning, deploying, and mixed methods evaluation

Vijayaraghavan Murali, Chandra Maddila, Imad Ahmad, Michael Bolin, Daniel Cheng, Negar Ghorbani, Renuka Fernandez, Nachiappan Nagappan, Peter C. Rigby

TL;DR

CodeCompose presents a scalable, AI-assisted code authoring system for Meta, built on InCoder with a novel Language Causal Masking objective to enable bidirectional code suggestions. Through fine-tuning on first-party data and a careful production rollout, the authors demonstrate substantial improvements in exact-match and BLEU metrics (EM 40–58%, BLEU 56–73%), higher adoption (22% acceptance, 8% of code written by the system), and strong positive developer feedback (91.5% favorable) across 9 languages. The work details system design, including a GPU inference cluster, a Rust LSP, and telemetry, and evaluates effectiveness via mixed methods (backtests, online production data, and thematic analysis). The findings underscore the practical viability of enterprise-scale AI-assisted coding, offering actionable lessons on trust, UX integration, and evaluation for future deployments and broader lifecycle support in software development.

Abstract

Generative LLMs have been shown to effectively power AI-based code authoring tools that can suggest entire statements or blocks of code during code authoring. In this paper we present CodeCompose, an AI-assisted code authoring tool developed and deployed at Meta internally. CodeCompose is based on the InCoder LLM that merges generative capabilities with bi-directionality. We have scaled up CodeCompose to serve tens of thousands of developers at Meta, across 9 programming languages and several coding surfaces. We present our experience in making design decisions about the model and system architecture for CodeCompose that addresses these challenges. To release a LLM model at this scale, we needed to first ensure that it is sufficiently accurate. In a random sample of 20K source code files, depending on the language, we are able to reproduce hidden lines between 40% and 58% of the time, an improvement of 1.4x and 4.1x over a model trained only on public data. We gradually rolled CodeCompose out to developers. At the time of this writing, 16K developers have used it with 8% of their code coming directly from CodeCompose. To triangulate our numerical findings, we conduct a thematic analysis on the feedback from 70 developers. We find that 91.5% of the feedback is positive, with the most common themes being discovering APIs, dealing with boilerplate code, and accelerating coding. Meta continues to integrate this feedback into CodeCompose.

AI-assisted Code Authoring at Scale: Fine-tuning, deploying, and mixed methods evaluation

TL;DR

Abstract

Paper Structure (30 sections, 2 equations, 3 figures, 3 tables)

This paper contains 30 sections, 2 equations, 3 figures, 3 tables.

Introduction
Model Development and Evaluation Methodology
Model Architecture and Training Objective
Training data
Model Evaluation Method and Measures
System Design
Server
Language Server Protocol
Clients
Telemetry
Evaluation Methodology for in Production
Results
RQ1. Model Evaluation Results
RQ2. Adoption Results
RQ3. Developer Feedback Results
...and 15 more sections

Figures (3)

Figure 1: (a) offers inline code suggestions in VSCode in a grey text appearing after the cursor when the user is typing code (Tab to accept), (b) changes its suggestion to adapt to a natural language comment, (c) suggests code or documentation based on code below the current position.
Figure 2: Steps to construct an input to the model in LCM: (i) the code is tokenized at trigger characters where we expect the model to offer suggestions in production, (ii) a random subsequence of tokens is selected to be masked as the "target" to predict given the code before and the code after it, (iii) any additional metadata such as the filename is added to the front, (iv) all four strings (metadata, code before, target, code after) are encoded into tokens, (v) since the model's input length is limited, a 70-30 split is applied to the code before and code after if needed, (vi) all tokens are concatenated together into a single list of tokens with special tokens added to denote the masked target portion. The code on the left shows an example with the randomly selected target portion highlighted.
Figure 3: System Architecture for . A pre-trained model checkpoint (InCoder) is trained by the trainer on first-party company data to create the fine-tuned model. This is served in a cluster of inference service machines with GPUs. When clients (VS Code, IDEs, web surfaces) require a code suggestion they send a JSON-RPC request to the language server protocol (LSP), which in turn communicates with the inference service through language-agnostic Thrift calls and sends the generated suggestion back. The LSP also logs telemetry which allows us to compute usage metrics, run experiments, and monitor for regressions. Clients, e.g., VS Code, then display the suggestion through their respective editors.

AI-assisted Code Authoring at Scale: Fine-tuning, deploying, and mixed methods evaluation

TL;DR

Abstract

AI-assisted Code Authoring at Scale: Fine-tuning, deploying, and mixed methods evaluation

Authors

TL;DR

Abstract

Table of Contents

Figures (3)