Table of Contents
Fetching ...

Grounded AI for Code Review: Resource-Efficient Large-Model Serving in Enterprise Pipelines

Sayan Mandal, Hua Jiang

TL;DR

The paper tackles the challenge of scalable, trustworthy code review in enterprise pipelines by coupling deterministic static-analysis grounding with a resource-efficient, on-prem LLM serving stack. It introduces a grounding-first workflow that extracts AST-guided context within a fixed token budget, enabling concise, evidence-based explanations and remediation guidance posted directly into PRs. A two-service architecture (PR-oriented Orchestrator and on-prem LLM Backend) uses quantization, multi-tier caching, and on-demand GPU loading to achieve sub-minute $p_{50}$ first-feedback times while maintaining competitive rule-violation reduction against larger baselines. The findings suggest grounded, open-weight deployments can reach enterprise-grade performance with strong auditability and governance, while remaining extensible to broader standards and languages. The work provides a practical blueprint for reproducible, auditable, and cost-efficient grounded AI-assisted code reviews at scale, with clear paths toward automated patching and richer telemetry.

Abstract

Automated code review adoption lags in compliance-heavy settings, where static analyzers produce high-volume, low-rationale outputs, and naive LLM use risks hallucination and incurring cost overhead. We present a production system for grounded, PR-native review that pairs static-analysis findings with AST-guided context extraction and a single-GPU, on-demand serving stack (quantized open-weight model, multi-tier caching) to deliver concise explanations and remediation guidance. Evaluated on safety-oriented C/C++ standards, the approach achieves sub-minute median first-feedback (offline p50 build+LLM 59.8s) while maintaining competitive violation reduction and lower violation rates versus larger proprietary models. The architecture is decoupled: teams can adopt the grounding/prompting layer or the serving layer independently. A small internal survey (n=8) provides directional signals of reduced triage effort and moderate perceived grounding, with participants reporting fewer human review iterations. We outline operational lessons and limitations, emphasizing reproducibility, auditability, and pathways to broader standards and assisted patching.

Grounded AI for Code Review: Resource-Efficient Large-Model Serving in Enterprise Pipelines

TL;DR

The paper tackles the challenge of scalable, trustworthy code review in enterprise pipelines by coupling deterministic static-analysis grounding with a resource-efficient, on-prem LLM serving stack. It introduces a grounding-first workflow that extracts AST-guided context within a fixed token budget, enabling concise, evidence-based explanations and remediation guidance posted directly into PRs. A two-service architecture (PR-oriented Orchestrator and on-prem LLM Backend) uses quantization, multi-tier caching, and on-demand GPU loading to achieve sub-minute first-feedback times while maintaining competitive rule-violation reduction against larger baselines. The findings suggest grounded, open-weight deployments can reach enterprise-grade performance with strong auditability and governance, while remaining extensible to broader standards and languages. The work provides a practical blueprint for reproducible, auditable, and cost-efficient grounded AI-assisted code reviews at scale, with clear paths toward automated patching and richer telemetry.

Abstract

Automated code review adoption lags in compliance-heavy settings, where static analyzers produce high-volume, low-rationale outputs, and naive LLM use risks hallucination and incurring cost overhead. We present a production system for grounded, PR-native review that pairs static-analysis findings with AST-guided context extraction and a single-GPU, on-demand serving stack (quantized open-weight model, multi-tier caching) to deliver concise explanations and remediation guidance. Evaluated on safety-oriented C/C++ standards, the approach achieves sub-minute median first-feedback (offline p50 build+LLM 59.8s) while maintaining competitive violation reduction and lower violation rates versus larger proprietary models. The architecture is decoupled: teams can adopt the grounding/prompting layer or the serving layer independently. A small internal survey (n=8) provides directional signals of reduced triage effort and moderate perceived grounding, with participants reporting fewer human review iterations. We outline operational lessons and limitations, emphasizing reproducibility, auditability, and pathways to broader standards and assisted patching.

Paper Structure

This paper contains 35 sections, 2 figures, 1 table, 2 algorithms.

Figures (2)

  • Figure 1: Complete end-to-end framework of AutoCodeReview: The system consists of Code-Review Orchestrator, which extracts, analyzes, and generates code review prompts (Static Analyzer + Prompt Generator), and LLM Serving Backend, which provides access to LLM Service.
  • Figure 2: Benchmark summary: severity-level reductions and introductions (top), per-rule outcome decomposition (bottom-left), and latency characteristics (bottom-right). Reduction = (pre-post)/pre over violations with pre>0; new-only violations (pre=0, post>0) contribute 1.0 to introductions.