Understanding How CodeLLMs (Mis)Predict Types with Activation Steering

Francesca Lucchetti; Arjun Guha

Understanding How CodeLLMs (Mis)Predict Types with Activation Steering

Francesca Lucchetti, Arjun Guha

TL;DR

The paper tackles whether CodeLLMs truly reason about code semantics by examining type prediction in gradually typed languages and exposing brittleness through semantics-preserving edits. It introduces adversarial, semantics-preserving prompts and an activation steering technique that uses per-layer steering vectors $\mathbf{v}^{\ell}$ to activate a latent type-prediction mechanism, achieving robust predictions across Python and TypeScript and across multiple model families. The results show that steering can recover correct type predictions on adversarial inputs, transferring across languages and outpacing in-context prompting and random baselines, though it does not consistently improve type precision. This work reveals a shared, language-agnostic semantic representation for type prediction within CodeLLMs and demonstrates activation steering as a principled method to align model behavior with code semantics. Overall, the study provides a framework for probing and activating latent semantic capabilities in LLMs, with implications for cross-language code understanding and robust program analysis.

Abstract

Large Language Models (LLMs) are widely used by software engineers for programming tasks. However, research shows that LLMs often lack a deep understanding of program semantics. Even minor changes to syntax, such as renaming variables, can significantly degrade performance across various tasks. In this work, we examine the task of type prediction: given a partially typed program, can a model predict a missing type annotations such that the resulting program is more typed? We construct a dataset of adversarial examples where models initially predict the correct types, but begin to fail after semantically irrelevant edits. This is problematic, as models should ideally generalize across different syntactic forms of semantically equivalent code. This lack of robustness suggests that models may have a shallow understanding of code semantics. Despite this, we provide evidence that LLMs do, in fact, learn robust mechanisms for type prediction-though these mechanisms often fail to activate in adversarial scenarios. By using activation steering, a method that manipulates a model's internal activations to guide it toward using latent knowledge, we restore accurate predictions on adversarial inputs. We show that steering successfully activates a type prediction mechanism that is shared by both Python and TypeScript, and is more effective than prompting with in-context examples. Across five different models, our comprehensive evaluation demonstrates that LLMs can learn generalizable representations of code semantics that transfer across programming languages.

Understanding How CodeLLMs (Mis)Predict Types with Activation Steering

TL;DR

to activate a latent type-prediction mechanism, achieving robust predictions across Python and TypeScript and across multiple model families. The results show that steering can recover correct type predictions on adversarial inputs, transferring across languages and outpacing in-context prompting and random baselines, though it does not consistently improve type precision. This work reveals a shared, language-agnostic semantic representation for type prediction within CodeLLMs and demonstrates activation steering as a principled method to align model behavior with code semantics. Overall, the study provides a framework for probing and activating latent semantic capabilities in LLMs, with implications for cross-language code understanding and robust program analysis.

Abstract

Paper Structure (29 sections, 1 equation, 65 figures)

This paper contains 29 sections, 1 equation, 65 figures.

Introduction
Background and Related Work
Classical type prediction and type inference
Neural type prediction
Mutation testing and program transformations
Activation Steering
Methodology
Adversarial Type Prediction Tasks
Type Prediction Prompt Format
Semantics-preserving Code Edits
Test sets and class balance
Finding the Type Prediction Mechanism
Constructing Steering Vectors
Results
Steering Improves Type Prediction on Out-of-Distribution Tasks
...and 14 more sections

Figures (65)

Figure 1: An example type prediction task, formulated for each type of model.
Figure 2: Examples of three semantics-preserving edits. The type prediction site is float. We ensure that each edit is internally consistent. E.g., in (\ref{['var-rename-example']}), when we rename the binding x to tmp, we rename references to the binding.
Figure 3: A fragment of a Python steering pair. The original code is 70 lines of text. The dict is the expected prediction. But, renaming config to __tmp0 makes the model mispredict Repository, which is a hallucination.
Figure 4: Steering accuracy for all models on the TypeScript test set, with steering on five consecutive layers. The models have a varying number of layers, so the $x$-axis is normalized: for a model with $n$ layers, $x=0$ indicates steering on the first five layers, and $x=1$ indicates steering on the last five layers.
Figure 5: Steering accuracy for StarCoderBase 7B on Python. Each plot show steers in one, three, and five consecutive layers respectively.
...and 60 more figures

Theorems & Definitions (1)

Definition 1: Type Prediction

Understanding How CodeLLMs (Mis)Predict Types with Activation Steering

TL;DR

Abstract

Understanding How CodeLLMs (Mis)Predict Types with Activation Steering

Authors

TL;DR

Abstract

Table of Contents

Figures (65)

Theorems & Definitions (1)