Table of Contents
Fetching ...

Risk Assessment Framework for Code LLMs via Leveraging Internal States

Yuheng Huang, Lei Ma, Keizaburo Nishikino, Takumi Akazaki

TL;DR

PtTrust tackles trustworthiness gaps in code LLMs by introducing a two-stage pre-training framework that learns internal-state representations through unsupervised learning and grounds them to risk signals via a semantic-binding stage. The approach enables fine-grained line-level risk detection and cross-language generalization, achieving strong performance across multiple datasets with limited labeled data. It emphasizes interpretability by producing sparse, monosemantic latent features that highlight erroneous lines, aiding developer trust and workflow integration. Overall, PtTrust offers a scalable, industry-ready mechanism for risk-aware interaction with code-generation LLMs and lays groundwork for broader, pre-training–driven risk assurance in AI systems.

Abstract

The pre-training paradigm plays a key role in the success of Large Language Models (LLMs), which have been recognized as one of the most significant advancements of AI recently. Building on these breakthroughs, code LLMs with advanced coding capabilities bring huge impacts on software engineering, showing the tendency to become an essential part of developers' daily routines. However, the current code LLMs still face serious challenges related to trustworthiness, as they can generate incorrect, insecure, or unreliable code. Recent exploratory studies find that it can be promising to detect such risky outputs by analyzing LLMs' internal states, akin to how the human brain unconsciously recognizes its own mistakes. Yet, most of these approaches are limited to narrow sub-domains of LLM operations and fall short of achieving industry-level scalability and practicability. To address these challenges, in this paper, we propose PtTrust, a two-stage risk assessment framework for code LLM based on internal state pre-training, designed to integrate seamlessly with the existing infrastructure of software companies. The core idea is that the risk assessment framework could also undergo a pre-training process similar to LLMs. Specifically, PtTrust first performs unsupervised pre-training on large-scale unlabeled source code to learn general representations of LLM states. Then, it uses a small, labeled dataset to train a risk predictor. We demonstrate the effectiveness of PtTrust through fine-grained, code line-level risk assessment and demonstrate that it generalizes across tasks and different programming languages. Further experiments also reveal that PtTrust provides highly intuitive and interpretable features, fostering greater user trust. We believe PtTrust makes a promising step toward scalable and trustworthy assurance for code LLMs.

Risk Assessment Framework for Code LLMs via Leveraging Internal States

TL;DR

PtTrust tackles trustworthiness gaps in code LLMs by introducing a two-stage pre-training framework that learns internal-state representations through unsupervised learning and grounds them to risk signals via a semantic-binding stage. The approach enables fine-grained line-level risk detection and cross-language generalization, achieving strong performance across multiple datasets with limited labeled data. It emphasizes interpretability by producing sparse, monosemantic latent features that highlight erroneous lines, aiding developer trust and workflow integration. Overall, PtTrust offers a scalable, industry-ready mechanism for risk-aware interaction with code-generation LLMs and lays groundwork for broader, pre-training–driven risk assurance in AI systems.

Abstract

The pre-training paradigm plays a key role in the success of Large Language Models (LLMs), which have been recognized as one of the most significant advancements of AI recently. Building on these breakthroughs, code LLMs with advanced coding capabilities bring huge impacts on software engineering, showing the tendency to become an essential part of developers' daily routines. However, the current code LLMs still face serious challenges related to trustworthiness, as they can generate incorrect, insecure, or unreliable code. Recent exploratory studies find that it can be promising to detect such risky outputs by analyzing LLMs' internal states, akin to how the human brain unconsciously recognizes its own mistakes. Yet, most of these approaches are limited to narrow sub-domains of LLM operations and fall short of achieving industry-level scalability and practicability. To address these challenges, in this paper, we propose PtTrust, a two-stage risk assessment framework for code LLM based on internal state pre-training, designed to integrate seamlessly with the existing infrastructure of software companies. The core idea is that the risk assessment framework could also undergo a pre-training process similar to LLMs. Specifically, PtTrust first performs unsupervised pre-training on large-scale unlabeled source code to learn general representations of LLM states. Then, it uses a small, labeled dataset to train a risk predictor. We demonstrate the effectiveness of PtTrust through fine-grained, code line-level risk assessment and demonstrate that it generalizes across tasks and different programming languages. Further experiments also reveal that PtTrust provides highly intuitive and interpretable features, fostering greater user trust. We believe PtTrust makes a promising step toward scalable and trustworthy assurance for code LLMs.

Paper Structure

This paper contains 24 sections, 4 equations, 8 figures, 5 tables.

Figures (8)

  • Figure 1: The Summarized Workflow of PtTrust.
  • Figure 3: Activation difference between buggy and correct code lines on Code Llama. Latent 121 is highlighted.
  • Figure 4: Activation difference between buggy and correct code lines on StarCoder2. Latent 44 is highlighted.
  • Figure 5: Activation difference between buggy and correct code lines on Qwen2.5-Coder. Latent 63 is highlighted.
  • Figure : Code Llama
  • ...and 3 more figures