Table of Contents
Fetching ...

Code vs Serialized AST Inputs for LLM-Based Code Summarization: An Empirical Study

Shijia Dong, Haoruo Zhao, Paul Harvey

TL;DR

The paper investigates whether fully serialized ASTs can match or exceed code-based inputs for LLM-based code summarization. It introduces AST(NIT), an augmentation and serialization pipeline that preserves lexical details and encodes tree structure into a compact linear sequence using Node-Index Traversal. Through systematic experiments on the CodeXGLUE Python subset with LLaMA-3.1-8B and LoRA fine-tuning, serialized ASTs achieve summarization quality comparable to raw code while offering substantial efficiency gains over the Structure-Based Traversal baseline. The findings suggest that emergent capabilities of large pre-trained LLMs may lessen the necessity for explicit AST encodings, while still leaving AST(NIT) as a practical option in data-limited scenarios and for broader applicability across languages and tasks.

Abstract

Summarizing source code into natural language descriptions (code summarization) helps developers better understand program functionality and reduce the burden of software maintenance. Abstract Syntax Trees (ASTs), as opposed to source code, have been shown to improve summarization quality in traditional encoder-decoder-based code summarization models. However, most large language model (LLM)-based code summarization methods rely on raw code or only incorporate partial AST signals, meaning that the potential of complete AST representation has not been fully explored for LLMs. This paper presents AST(NIT), an AST augmentation and serialization method that preserves lexical details and encodes structural information into LLM-compatible sequences. Experiments with the LLaMA-3.1-8B model on the CodeXGLUE Python dataset show that the proposed serialized ASTs reduce the length of LLM inputs, require shorter training times, and achieve summarization quality comparable to existing approaches.

Code vs Serialized AST Inputs for LLM-Based Code Summarization: An Empirical Study

TL;DR

The paper investigates whether fully serialized ASTs can match or exceed code-based inputs for LLM-based code summarization. It introduces AST(NIT), an augmentation and serialization pipeline that preserves lexical details and encodes tree structure into a compact linear sequence using Node-Index Traversal. Through systematic experiments on the CodeXGLUE Python subset with LLaMA-3.1-8B and LoRA fine-tuning, serialized ASTs achieve summarization quality comparable to raw code while offering substantial efficiency gains over the Structure-Based Traversal baseline. The findings suggest that emergent capabilities of large pre-trained LLMs may lessen the necessity for explicit AST encodings, while still leaving AST(NIT) as a practical option in data-limited scenarios and for broader applicability across languages and tasks.

Abstract

Summarizing source code into natural language descriptions (code summarization) helps developers better understand program functionality and reduce the burden of software maintenance. Abstract Syntax Trees (ASTs), as opposed to source code, have been shown to improve summarization quality in traditional encoder-decoder-based code summarization models. However, most large language model (LLM)-based code summarization methods rely on raw code or only incorporate partial AST signals, meaning that the potential of complete AST representation has not been fully explored for LLMs. This paper presents AST(NIT), an AST augmentation and serialization method that preserves lexical details and encodes structural information into LLM-compatible sequences. Experiments with the LLaMA-3.1-8B model on the CodeXGLUE Python dataset show that the proposed serialized ASTs reduce the length of LLM inputs, require shorter training times, and achieve summarization quality comparable to existing approaches.
Paper Structure (22 sections, 2 figures, 6 tables)

This paper contains 22 sections, 2 figures, 6 tables.

Figures (2)

  • Figure 1: Side-by-side comparison of (a) the raw AST and (b) the augmented AST for Listing 1.
  • Figure 2: The overall workflow of the proposed method AST(NIT) for code summarization with serialized AST inputs.