Table of Contents
Fetching ...

The Code Barrier: What LLMs Actually Understand?

Serge Lionel Nikiema, Jordan Samhi, Abdoul Kader Kaboré, Jacques Klein, Tegawendé F. Bissyandé

TL;DR

The study investigates whether large language models truly understand code semantics by applying graduated obfuscation to Java code from CodeNet, and evaluating description generation, obfuscated-code description, and deobfuscation across 13 models. It reveals that obfuscation increasingly degrades description quality, with lexical cues (e.g., variable names) being particularly influential, while general-purpose models sometimes outperform code-specialized ones. Deobfuscation achieves higher functional success than description generation, but semantic preservation is inconsistent, highlighting gaps in current representations for code understanding. The work introduces Obscura as a benchmark for semantic code understanding and reverse-engineering tasks, offering empirical baselines and guiding future development of robust, obfuscation-resistant code analysis tools.

Abstract

Understanding code represents a core ability needed for automating software development tasks. While foundation models like LLMs show impressive results across many software engineering challenges, the extent of their true semantic understanding beyond simple token recognition remains unclear. This research uses code obfuscation as a structured testing framework to evaluate LLMs' semantic understanding capabilities. We methodically apply controlled obfuscation changes to source code and measure comprehension through two complementary tasks: generating accurate descriptions of obfuscated code and performing deobfuscation, a skill with important implications for reverse engineering applications. Our testing approach includes 13 cutting-edge models, covering both code-specialized (e.g., StarCoder2) and general-purpose (e.g., GPT-4o) architectures, evaluated on a benchmark created from CodeNet and consisting of filtered 250 Java programming problems and their solutions. Findings show a statistically significant performance decline as obfuscation complexity increases, with unexpected resilience shown by general-purpose models compared to their code-focused counterparts. While some models successfully identify obfuscation techniques, their ability to reconstruct the underlying program logic remains constrained, suggesting limitations in their semantic representation mechanisms. This research introduces a new evaluation approach for assessing code comprehension in language models and establishes empirical baselines for advancing research in security-critical code analysis applications such as reverse engineering and adversarial code analysis.

The Code Barrier: What LLMs Actually Understand?

TL;DR

The study investigates whether large language models truly understand code semantics by applying graduated obfuscation to Java code from CodeNet, and evaluating description generation, obfuscated-code description, and deobfuscation across 13 models. It reveals that obfuscation increasingly degrades description quality, with lexical cues (e.g., variable names) being particularly influential, while general-purpose models sometimes outperform code-specialized ones. Deobfuscation achieves higher functional success than description generation, but semantic preservation is inconsistent, highlighting gaps in current representations for code understanding. The work introduces Obscura as a benchmark for semantic code understanding and reverse-engineering tasks, offering empirical baselines and guiding future development of robust, obfuscation-resistant code analysis tools.

Abstract

Understanding code represents a core ability needed for automating software development tasks. While foundation models like LLMs show impressive results across many software engineering challenges, the extent of their true semantic understanding beyond simple token recognition remains unclear. This research uses code obfuscation as a structured testing framework to evaluate LLMs' semantic understanding capabilities. We methodically apply controlled obfuscation changes to source code and measure comprehension through two complementary tasks: generating accurate descriptions of obfuscated code and performing deobfuscation, a skill with important implications for reverse engineering applications. Our testing approach includes 13 cutting-edge models, covering both code-specialized (e.g., StarCoder2) and general-purpose (e.g., GPT-4o) architectures, evaluated on a benchmark created from CodeNet and consisting of filtered 250 Java programming problems and their solutions. Findings show a statistically significant performance decline as obfuscation complexity increases, with unexpected resilience shown by general-purpose models compared to their code-focused counterparts. While some models successfully identify obfuscation techniques, their ability to reconstruct the underlying program logic remains constrained, suggesting limitations in their semantic representation mechanisms. This research introduces a new evaluation approach for assessing code comprehension in language models and establishes empirical baselines for advancing research in security-critical code analysis applications such as reverse engineering and adversarial code analysis.

Paper Structure

This paper contains 37 sections, 10 figures, 2 tables.

Figures (10)

  • Figure 1: Description generation process and comparison against ground-truth
  • Figure 2: Deobfuscation process
  • Figure 3: LLMs effectiveness in code description (with or without comments)
  • Figure 4: Effectiveness of LLMs to generate descriptions on problems with comments (Japanese vs English translations).
  • Figure 5: Comment to code Ratio-English vs Japanese (translated in English)
  • ...and 5 more figures