Analyzing Latent Concepts in Code Language Models
Arushi Sharma, Vedant Pungliya, Christopher J. Quinn, Ali Jannesari
TL;DR
Code Concept Analysis (CoCoA) addresses the interpretability gap in code language models by globally discovering latent lexical, syntactic, and semantic concepts through layer-wise clustering of token representations. It introduces CodeConceptNet (CoCoNet) and an LLM-assisted annotation pipeline to label concept clusters, and demonstrates how concept-grounded explanations can improve local attributions when combined with Integrated Gradients. Across CodeNet-based experiments on CodeBERT, UniXCoder, and DeepSeekCoder, CoCoA reveals robust, progressively abstract concepts that evolve with fine-tuning and generalize across models, while providing a practical framework for robust explanations and bias detection. The work also reports strong results in a user study where concept-augmented explanations improved human-centric explainability by 37 percentage points and shows that prompt engineering can markedly enhance the quality and consistency of generated labels. Overall, CoCoA offers a scalable, model-centric pathway to interpretability in code-language models with concrete, actionable annotations and explanations.
Abstract
Interpreting the internal behavior of large language models trained on code remains a critical challenge, particularly for applications demanding trust, transparency, and semantic robustness. We propose Code Concept Analysis (CoCoA): a global post-hoc interpretability framework that uncovers emergent lexical, syntactic, and semantic structures in a code language model's representation space by clustering contextualized token embeddings into human-interpretable concept groups. We propose a hybrid annotation pipeline that combines static analysis tool-based syntactic alignment with prompt-engineered large language models (LLMs), enabling scalable labeling of latent concepts across abstraction levels. We analyse the distribution of concepts across layers and across three finetuning tasks. Emergent concept clusters can help identify unexpected latent interactions and be used to identify trends and biases within the model's learned representations. We further integrate LCA with local attribution methods to produce concept-grounded explanations, improving the coherence and interpretability of token-level saliency. Empirical evaluations across multiple models and tasks show that LCA discovers concepts that remain stable under semantic-preserving perturbations (average Cluster Sensitivity Index, CSI = 0.288) and evolve predictably with fine-tuning. In a user study on the programming-language classification task, concept-augmented explanations disambiguated token roles and improved human-centric explainability by 37 percentage points compared with token-level attributions using Integrated Gradients.
