Inside Out: Uncovering How Comment Internalization Steers LLMs for Better or Worse
Aaron Imani, Mohammad Moshirpour, Iftekhar Ahmed
TL;DR
This paper presents a first concept-level interpretability study of LLMs in software engineering, showing that code comments are internalized as distinct latent concepts in LLMs and that different comment types (Javadoc, inline, multiline) are encoded with varying robustness. Using Concept Activation Vectors, the authors demonstrate that manipulating these concepts in the embedding space can causally affect SE tasks such as code translation, completion, and refinement in a highly task- and model-dependent manner. Activation patterns differ by task, with code summarization triggering the strongest activation and code completion the weakest, and findings held across three distinct LLM variants. The work suggests new avenues for steerable SE tools that operate on internal concept representations rather than solely on surface inputs, with implications for prompt design, data curation, and targeted interventions.
Abstract
While comments are non-functional elements of source code, Large Language Models (LLM) frequently rely on them to perform Software Engineering (SE) tasks. Yet, where in the model this reliance resides, and how it affects performance, remains poorly understood. We present the first concept-level interpretability study of LLMs in SE, analyzing three tasks - code completion, translation, and refinement - through the lens of internal comment representation. Using Concept Activation Vectors (CAV), we show that LLMs not only internalize comments as distinct latent concepts but also differentiate between subtypes such as Javadocs, inline, and multiline comments. By systematically activating and deactivating these concepts in the LLMs' embedding space, we observed significant, model-specific, and task-dependent shifts in performance ranging from -90% to +67%. Finally, we conducted a controlled experiment using the same set of code inputs, prompting LLMs to perform 10 distinct SE tasks while measuring the activation of the comment concept within their latent representations. We found that code summarization consistently triggered the strongest activation of comment concepts, whereas code completion elicited the weakest sensitivity. These results open a new direction for building SE tools and models that reason about and manipulate internal concept representations rather than relying solely on surface-level input.
