Table of Contents
Fetching ...

When Names Disappear: Revealing What LLMs Actually Understand About Code

Cuong Chi Le, Minh V. T. Pham, Cuong Duc Van, Hoang N. Phan, Huy N. Phan, Tien N. Nguyen

TL;DR

This work argues that code understanding in large language models emerges from two channels: structural semantics and human-interpretable naming. By applying a suite of semantics-preserving obfuscations that disrupt names while preserving behavior, the authors systematically probe whether LLMs rely on naming cues for intent and execution tasks. They show that code summarization on real-world, naming-rich data degrades markedly under obfuscation, while algorithmic code remains comparatively robust, and that execution prediction can also falter, revealing memorization shortcuts tied to identifiers. The release of ClassEval-Obf provides a more reliable benchmark for evaluating true semantic reasoning in code understanding, aiming to reduce inflated performance from naming leakage and promote robust generalization in LLMs.

Abstract

Large Language Models (LLMs) achieve strong results on code tasks, but how they derive program meaning remains unclear. We argue that code communicates through two channels: structural semantics, which define formal behavior, and human-interpretable naming, which conveys intent. Removing the naming channel severely degrades intent-level tasks such as summarization, where models regress to line-by-line descriptions. Surprisingly, we also observe consistent reductions on execution tasks that should depend only on structure, revealing that current benchmarks reward memorization of naming patterns rather than genuine semantic reasoning. To disentangle these effects, we introduce a suite of semantics-preserving obfuscations and show that they expose identifier leakage across both summarization and execution. Building on these insights, we release ClassEval-Obf, an obfuscation-enhanced benchmark that systematically suppresses naming cues while preserving behavior. Our results demonstrate that ClassEval-Obf reduces inflated performance gaps, weakens memorization shortcuts, and provides a more reliable basis for assessing LLMs' code understanding and generalization.

When Names Disappear: Revealing What LLMs Actually Understand About Code

TL;DR

This work argues that code understanding in large language models emerges from two channels: structural semantics and human-interpretable naming. By applying a suite of semantics-preserving obfuscations that disrupt names while preserving behavior, the authors systematically probe whether LLMs rely on naming cues for intent and execution tasks. They show that code summarization on real-world, naming-rich data degrades markedly under obfuscation, while algorithmic code remains comparatively robust, and that execution prediction can also falter, revealing memorization shortcuts tied to identifiers. The release of ClassEval-Obf provides a more reliable benchmark for evaluating true semantic reasoning in code understanding, aiming to reduce inflated performance from naming leakage and promote robust generalization in LLMs.

Abstract

Large Language Models (LLMs) achieve strong results on code tasks, but how they derive program meaning remains unclear. We argue that code communicates through two channels: structural semantics, which define formal behavior, and human-interpretable naming, which conveys intent. Removing the naming channel severely degrades intent-level tasks such as summarization, where models regress to line-by-line descriptions. Surprisingly, we also observe consistent reductions on execution tasks that should depend only on structure, revealing that current benchmarks reward memorization of naming patterns rather than genuine semantic reasoning. To disentangle these effects, we introduce a suite of semantics-preserving obfuscations and show that they expose identifier leakage across both summarization and execution. Building on these insights, we release ClassEval-Obf, an obfuscation-enhanced benchmark that systematically suppresses naming cues while preserving behavior. Our results demonstrate that ClassEval-Obf reduces inflated performance gaps, weakens memorization shortcuts, and provides a more reliable basis for assessing LLMs' code understanding and generalization.

Paper Structure

This paper contains 19 sections, 12 figures, 4 tables.

Figures (12)

  • Figure 1: Names as semantic anchors for summarization. LLM produces an intent-level summary with the original identifiers (left), but collapses to line-by-line narration after name-only obfuscation (right), despite identical structure and behavior.
  • Figure 2: Distribution of variable name lengths in ClassEval and LiveCodeBench.
  • Figure 3: Illustration of the four obfuscation strategies applied in our study.
  • Figure 4: Qualitative example: GPT–4o’s step-by-step reasoning on the original (left) and an Ambiguous identifiers obfuscation (right) of the same program yields the same final correct result (7.0).
  • Figure 5: Execution prediction performance on original vs. obfuscated ClassEval (high-complexity subset)
  • ...and 7 more figures