Understanding How CodeLLMs (Mis)Predict Types with Activation Steering
Francesca Lucchetti, Arjun Guha
TL;DR
The paper tackles whether CodeLLMs truly reason about code semantics by examining type prediction in gradually typed languages and exposing brittleness through semantics-preserving edits. It introduces adversarial, semantics-preserving prompts and an activation steering technique that uses per-layer steering vectors $\mathbf{v}^{\ell}$ to activate a latent type-prediction mechanism, achieving robust predictions across Python and TypeScript and across multiple model families. The results show that steering can recover correct type predictions on adversarial inputs, transferring across languages and outpacing in-context prompting and random baselines, though it does not consistently improve type precision. This work reveals a shared, language-agnostic semantic representation for type prediction within CodeLLMs and demonstrates activation steering as a principled method to align model behavior with code semantics. Overall, the study provides a framework for probing and activating latent semantic capabilities in LLMs, with implications for cross-language code understanding and robust program analysis.
Abstract
Large Language Models (LLMs) are widely used by software engineers for programming tasks. However, research shows that LLMs often lack a deep understanding of program semantics. Even minor changes to syntax, such as renaming variables, can significantly degrade performance across various tasks. In this work, we examine the task of type prediction: given a partially typed program, can a model predict a missing type annotations such that the resulting program is more typed? We construct a dataset of adversarial examples where models initially predict the correct types, but begin to fail after semantically irrelevant edits. This is problematic, as models should ideally generalize across different syntactic forms of semantically equivalent code. This lack of robustness suggests that models may have a shallow understanding of code semantics. Despite this, we provide evidence that LLMs do, in fact, learn robust mechanisms for type prediction-though these mechanisms often fail to activate in adversarial scenarios. By using activation steering, a method that manipulates a model's internal activations to guide it toward using latent knowledge, we restore accurate predictions on adversarial inputs. We show that steering successfully activates a type prediction mechanism that is shared by both Python and TypeScript, and is more effective than prompting with in-context examples. Across five different models, our comprehensive evaluation demonstrates that LLMs can learn generalizable representations of code semantics that transfer across programming languages.
