Table of Contents
Fetching ...

Balancing Latency and Accuracy of Code Completion via Local-Cloud Model Cascading

Hanzhen Lu, Lishui Fan, Jiachi Chen, Qiuyuan Chen, Zhao Wei, Zhongxin Liu

TL;DR

This work proposes MCCom (Model-Cascading-based code Completion), a framework that cascades a local SLM with a cloud-based LLM, significantly reducing cloud computation costs and improving the LLM's exact match rate through effective collaboration.

Abstract

Line-level code completion requires a critical balance between high accuracy and low latency. Existing methods suffer from a trade-off: large language models (LLMs) provide high-quality suggestions but incur high latency, while small language models (SLMs) are fast but often suboptimal. We propose MCCom (Model-Cascading-based code Completion), a framework that cascades a local SLM with a cloud-based LLM. To achieve effective cascading, MCCom leverages user actions as a novel signal to trigger the LLM only when the SLM fails, significantly reducing cloud computation costs. Furthermore, we introduce a two-stage speculative decoding strategy and an iterative retrieval mechanism to enhance collaboration between the models. We also train a 121M-parameter lightweight model, which achieves 73.8% of the performance of a 7B state-of-the-art model. Evaluated on RepoEval and a new real-world benchmark StmtEval, MCCom reduces inference latency by up to 47.9% and LLM usage by 46.3%, while improving the LLM's exact match rate by 8.9% through effective collaboration.

Balancing Latency and Accuracy of Code Completion via Local-Cloud Model Cascading

TL;DR

This work proposes MCCom (Model-Cascading-based code Completion), a framework that cascades a local SLM with a cloud-based LLM, significantly reducing cloud computation costs and improving the LLM's exact match rate through effective collaboration.

Abstract

Line-level code completion requires a critical balance between high accuracy and low latency. Existing methods suffer from a trade-off: large language models (LLMs) provide high-quality suggestions but incur high latency, while small language models (SLMs) are fast but often suboptimal. We propose MCCom (Model-Cascading-based code Completion), a framework that cascades a local SLM with a cloud-based LLM. To achieve effective cascading, MCCom leverages user actions as a novel signal to trigger the LLM only when the SLM fails, significantly reducing cloud computation costs. Furthermore, we introduce a two-stage speculative decoding strategy and an iterative retrieval mechanism to enhance collaboration between the models. We also train a 121M-parameter lightweight model, which achieves 73.8% of the performance of a 7B state-of-the-art model. Evaluated on RepoEval and a new real-world benchmark StmtEval, MCCom reduces inference latency by up to 47.9% and LLM usage by 46.3%, while improving the LLM's exact match rate by 8.9% through effective collaboration.
Paper Structure (34 sections, 2 equations, 2 figures, 11 tables)

This paper contains 34 sections, 2 equations, 2 figures, 11 tables.

Figures (2)

  • Figure 1: The overview of MCCom
  • Figure 2: Statistical Information about Efficient Results