Enhancing Deployment-Time Predictive Model Robustness for Code Analysis and Optimization

Huanting Wang; Patrick Lenihan; Zheng Wang

Enhancing Deployment-Time Predictive Model Robustness for Code Analysis and Optimization

Huanting Wang, Patrick Lenihan, Zheng Wang

TL;DR

Prom addresses deployment-time data drift in ML for code analysis and optimization by leveraging conformal prediction to quantify prediction credibility and confidence. It offers a model-agnostic Python toolkit with adaptive calibration, an ensemble of nonconformity measures, and an incremental-learning feedback loop to retrain on drifting samples. Across 13 models and 5 tasks, Prom detects drifting inputs with about $96\%$ average recall and enables performance restoration close to design-time by relabeling as little as $5\%$ of drifted samples, reducing labeling overhead. This approach enhances robustness without altering model architectures and supports practical deployment for reliable code optimization and analysis workflows.

Abstract

Supervised machine learning techniques have shown promising results in code analysis and optimization problems. However, a learning-based solution can be brittle because minor changes in hardware or application workloads -- such as facing a new CPU architecture or code pattern -- may jeopardize decision accuracy, ultimately undermining model robustness. We introduce Prom, an open-source library to enhance the robustness and performance of predictive models against such changes during deployment. Prom achieves this by using statistical assessments to identify test samples prone to mispredictions and using feedback on these samples to improve a deployed model. We showcase Prom by applying it to 13 representative machine learning models across 5 code analysis and optimization tasks. Our extensive evaluation demonstrates that Prom can successfully identify an average of 96% (up to 100%) of mispredictions. By relabeling up to 5% of the Prom-identified samples through incremental learning, Prom can help a deployed model achieve a performance comparable to that attained during its model training phase.

Enhancing Deployment-Time Predictive Model Robustness for Code Analysis and Optimization

TL;DR

average recall and enables performance restoration close to design-time by relabeling as little as

of drifted samples, reducing labeling overhead. This approach enhances robustness without altering model architectures and supports practical deployment for reliable code optimization and analysis workflows.

Abstract

Paper Structure (34 sections, 3 equations, 13 figures, 4 tables)

This paper contains 34 sections, 3 equations, 13 figures, 4 tables.

INTRODUCTION
MOTIVATION
BACKGROUND
The Need of Credibility Evaluation
Statistical Assessment
Conformal Prediction
OVERVIEW of PROM
Implementation
Model design phase
Model deployment phase
METHODOLOGY
Nonconformity Measures
Nonconformity functions
Computing p-value
Initialization Assessment
...and 19 more sections

Figures (13)

Figure 1: Motivation example: impact of data drift on ML models for code vulnerability detection.
Figure 2: Workflow of Prom during deployment.
Figure 3: At design time, Prom splits the training data into training and calibration sets. During deployment, it calculates credibility and confidence scores, using majority voting to detect drifting samples. These samples can then be labeled for model updates via offline incremental training.
Figure 4: Simplified code template of Prom.
Figure 5: Prom integrates multiple nonconformity functions that vote to reject or approve the ML prediction.
...and 8 more figures

Enhancing Deployment-Time Predictive Model Robustness for Code Analysis and Optimization

TL;DR

Abstract

Enhancing Deployment-Time Predictive Model Robustness for Code Analysis and Optimization

Authors

TL;DR

Abstract

Table of Contents

Figures (13)