Table of Contents
Fetching ...

Regression Language Models for Code

Yash Akhauri, Xingyou Song, Arissa Wongpanich, Bryan Lewandowski, Mohamed S. Abdelfattah

TL;DR

This work proposes Regression Language Models (RLMs) that treat code-to-metric regression as a text-to-text task, enabling a single, unified model to predict memory, latency, and accuracy across multiple programming languages and representations. An encoder–decoder RLM initialized from T5Gemma, with a 300M parameter size and a custom digit-by-digit numeric tokenizer, achieves strong ranking correlations on APPS ($\rho$ > 0.9) and CodeNet ($\rho$ > 0.5 across 17 languages), and attains Kendall $\tau$ of $0.46$ on NAS benchmarks, while supporting multi-objective latency predictions across hardware. The approach avoids heavy feature engineering, leverages pretraining (language and regression) for faster convergence, and uses an ONNX-based, unified representation for NAS, enabling cross-domain applicability and competitive performance against graph-based methods. Ablations show the benefits of learned tokenization, longer context, and decoder-based regression, suggesting that code-based regression can be effectively cast as a single, scalable regression task aligned with the modern LLM paradigm.

Abstract

We study code-to-metric regression: predicting numeric outcomes of code executions, a challenging task due to the open-ended nature of programming languages. While prior methods have resorted to heavy and domain-specific feature engineering, we show that a single unified Regression Language Model (RLM) can simultaneously predict directly from text, (i) the memory footprint of code across multiple high-level languages such as Python and C++, (ii) the latency of Triton GPU kernels, and (iii) the accuracy and speed of trained neural networks represented in ONNX. In particular, a relatively small 300M parameter RLM initialized from T5Gemma, obtains > 0.9 Spearman-rank on competitive programming submissions from APPS, and a single unified model achieves > 0.5 average Spearman-rank across 17 separate languages from CodeNet. Furthermore, the RLM can obtain the highest average Kendall-Tau of 0.46 on five classic NAS design spaces previously dominated by graph neural networks, and simultaneously predict architecture latencies on numerous hardware platforms.

Regression Language Models for Code

TL;DR

This work proposes Regression Language Models (RLMs) that treat code-to-metric regression as a text-to-text task, enabling a single, unified model to predict memory, latency, and accuracy across multiple programming languages and representations. An encoder–decoder RLM initialized from T5Gemma, with a 300M parameter size and a custom digit-by-digit numeric tokenizer, achieves strong ranking correlations on APPS ( > 0.9) and CodeNet ( > 0.5 across 17 languages), and attains Kendall of on NAS benchmarks, while supporting multi-objective latency predictions across hardware. The approach avoids heavy feature engineering, leverages pretraining (language and regression) for faster convergence, and uses an ONNX-based, unified representation for NAS, enabling cross-domain applicability and competitive performance against graph-based methods. Ablations show the benefits of learned tokenization, longer context, and decoder-based regression, suggesting that code-based regression can be effectively cast as a single, scalable regression task aligned with the modern LLM paradigm.

Abstract

We study code-to-metric regression: predicting numeric outcomes of code executions, a challenging task due to the open-ended nature of programming languages. While prior methods have resorted to heavy and domain-specific feature engineering, we show that a single unified Regression Language Model (RLM) can simultaneously predict directly from text, (i) the memory footprint of code across multiple high-level languages such as Python and C++, (ii) the latency of Triton GPU kernels, and (iii) the accuracy and speed of trained neural networks represented in ONNX. In particular, a relatively small 300M parameter RLM initialized from T5Gemma, obtains > 0.9 Spearman-rank on competitive programming submissions from APPS, and a single unified model achieves > 0.5 average Spearman-rank across 17 separate languages from CodeNet. Furthermore, the RLM can obtain the highest average Kendall-Tau of 0.46 on five classic NAS design spaces previously dominated by graph neural networks, and simultaneously predict architecture latencies on numerous hardware platforms.

Paper Structure

This paper contains 35 sections, 15 figures, 14 tables.

Figures (15)

  • Figure 1: A Regression Language Model (RLM) is able to simultaneously read code from many different languages and compilation levels, and predict metrics such as accuracy, memory, and latency.
  • Figure 2: Diagonal fit ($\diagup$) is better. Scatterplot of RLM's pointwise $y$-prediction vs. ground truth value over varying tasks from CodeNet (C++ and Python), Triton Kernels, and APPS. For better visualization, axes are scaled by percentile (probits), and $y$-value ticks are shown at 10 and 90%.
  • Figure 3: We identified problems with $>$8 candidate solution from our test set of 15000, and investigate whether the RLM is able to rank potential solutions. (Left) Distribution of problems and their in-problem Spearman $\rho$ rankings using the RLM. (Right) RLM vs random selection for choosing the top-1 lowest memory solution from a question, organized by solution count.
  • Figure 4: Side-by-side solutions from the APPS dataset. Left minimizes memory (O(1) extra space, $O(nm)$ time). Right is often faster due to hash lookups but uses more memory via Counter, set, and per-iteration intersection. RLM predicted 5488 (left) and 10489.5 (right) bytes; ground truth: 5464 and 9672.
  • Figure 5: Single RLM trained on five consecutive objectives on NASBench-201, i.e. first validation accuracy and then hardware-specific latencies over four devices (Pixel3 (Mobile), Eyeriss (ASIC), Intel CPU and Nvidia GPU). Spearman $\rho$ refers to predicted latency. Density estimates (blue) are plotted for predicted Pareto-optimal points $x^{*}$.
  • ...and 10 more figures