Do Large Language Models Pay Similar Attention Like Human Programmers When Generating Code?

Bonan Kou; Shengmai Chen; Zhijie Wang; Lei Ma; Tianyi Zhang

Do Large Language Models Pay Similar Attention Like Human Programmers When Generating Code?

Bonan Kou, Shengmai Chen, Zhijie Wang, Lei Ma, Tianyi Zhang

TL;DR

This study interrogates whether six large language models attending to task descriptions mirror human programmers when generating code. By constructing a programmer-attention dataset on 1,138 Python tasks and evaluating twelve attention-calculation methods across perturbation, gradient, and self-attention approaches, the authors reveal a robust misalignment between model and human attention, with only a minority of errors attributable to attentional misalignment. A perturbation-based method (BERT-masking) provides the best alignment and a user study shows programmers prefer such explanations, though trust remains limited. The work yields practical guidance for choosing attention-calculation techniques, suggests directions for human-aligned LLMs and attention-aware training, and provides a public dataset to foster interpretability in code generation.

Abstract

Large Language Models (LLMs) have recently been widely used for code generation. Due to the complexity and opacity of LLMs, little is known about how these models generate code. We made the first attempt to bridge this knowledge gap by investigating whether LLMs attend to the same parts of a task description as human programmers during code generation. An analysis of six LLMs, including GPT-4, on two popular code generation benchmarks revealed a consistent misalignment between LLMs' and programmers' attention. We manually analyzed 211 incorrect code snippets and found five attention patterns that can be used to explain many code generation errors. Finally, a user study showed that model attention computed by a perturbation-based method is often favored by human programmers. Our findings highlight the need for human-aligned LLMs for better interpretability and programmer trust.

Do Large Language Models Pay Similar Attention Like Human Programmers When Generating Code?

TL;DR

Abstract

Paper Structure (37 sections, 5 figures, 3 tables)

This paper contains 37 sections, 5 figures, 3 tables.

INTRODUCTION
MOTIVATION AND PRELIMINARIES
Motivation
Code Generation Benchmarks and Metrics
Model Attention
Self-attention-based Methods.
Gradient-based Methods.
Perturbation-based Methods.
THE CONSTRUCTION OF THE PROGRAMMER ATTENTION DATASET
METHODOLOGY
Code Generation Models
Model Attention Calculation
Self-attention-based Methods.
Gradient-based Methods.
Perturbation-based Methods.
...and 22 more sections

Figures (5)

Figure 1: A Python function generated by CodeGen-2.7B nijkamp2022codegen. The generated code is highlighted in green.
Figure 2: Attention matrix of the first attention head of the transformer layer in CodeGen-2.7B
Figure 3: Two examples of labeled prompts from our dataset.
Figure 4: Mapping NL words to LLM sub-tokens
Figure 5: Participants' choices over different attention calculation methods in three dimensions.

Do Large Language Models Pay Similar Attention Like Human Programmers When Generating Code?

TL;DR

Abstract

Do Large Language Models Pay Similar Attention Like Human Programmers When Generating Code?

Authors

TL;DR

Abstract

Table of Contents

Figures (5)