Table of Contents
Fetching ...

Follow-up Attention: An Empirical Study of Developer and Neural Model Code Exploration

Matteo Paltenghi, Rahul Pandita, Austin Z. Henley, Albert Ziegler

TL;DR

This work probes whether transformer attention signals from code-focused LLMs align with developers' code exploration during sensemaking tasks. It introduces an eye-tracking dataset (92 sessions, 25 developers) and a zero-shot prompting protocol to compare three open models (CodeGen, InCoder, GPT-J) against human behavior, using novel extraction methods such as follow-up attention to derive token- and line-level interaction cues. The key finding is that follow-up attention achieves the strongest alignment with human attention and can predict the next line a developer will inspect with $47\%$ accuracy, outperforming a $42.3\%$ history-based baseline. The results suggest that attention signals from pre-trained models can be leveraged to support code exploration and IDE tooling, with implications for context prioritization and interactive developer tools.

Abstract

Recent neural models of code, such as OpenAI Codex and AlphaCode, have demonstrated remarkable proficiency at code generation due to the underlying attention mechanism. However, it often remains unclear how the models actually process code, and to what extent their reasoning and the way their attention mechanism scans the code matches the patterns of developers. A poor understanding of the model reasoning process limits the way in which current neural models are leveraged today, so far mostly for their raw prediction. To fill this gap, this work studies how the processed attention signal of three open large language models - CodeGen, InCoder and GPT-J - agrees with how developers look at and explore code when each answers the same sensemaking questions about code. Furthermore, we contribute an open-source eye-tracking dataset comprising 92 manually-labeled sessions from 25 developers engaged in sensemaking tasks. We empirically evaluate five heuristics that do not use the attention and ten attention-based post-processing approaches of the attention signal of CodeGen against our ground truth of developers exploring code, including the novel concept of follow-up attention which exhibits the highest agreement between model and human attention. Our follow-up attention method can predict the next line a developer will look at with 47% accuracy. This outperforms the baseline prediction accuracy of 42.3%, which uses the session history of other developers to recommend the next line. These results demonstrate the potential of leveraging the attention signal of pre-trained models for effective code exploration.

Follow-up Attention: An Empirical Study of Developer and Neural Model Code Exploration

TL;DR

This work probes whether transformer attention signals from code-focused LLMs align with developers' code exploration during sensemaking tasks. It introduces an eye-tracking dataset (92 sessions, 25 developers) and a zero-shot prompting protocol to compare three open models (CodeGen, InCoder, GPT-J) against human behavior, using novel extraction methods such as follow-up attention to derive token- and line-level interaction cues. The key finding is that follow-up attention achieves the strongest alignment with human attention and can predict the next line a developer will inspect with accuracy, outperforming a history-based baseline. The results suggest that attention signals from pre-trained models can be leveraged to support code exploration and IDE tooling, with implications for context prioritization and interactive developer tools.

Abstract

Recent neural models of code, such as OpenAI Codex and AlphaCode, have demonstrated remarkable proficiency at code generation due to the underlying attention mechanism. However, it often remains unclear how the models actually process code, and to what extent their reasoning and the way their attention mechanism scans the code matches the patterns of developers. A poor understanding of the model reasoning process limits the way in which current neural models are leveraged today, so far mostly for their raw prediction. To fill this gap, this work studies how the processed attention signal of three open large language models - CodeGen, InCoder and GPT-J - agrees with how developers look at and explore code when each answers the same sensemaking questions about code. Furthermore, we contribute an open-source eye-tracking dataset comprising 92 manually-labeled sessions from 25 developers engaged in sensemaking tasks. We empirically evaluate five heuristics that do not use the attention and ten attention-based post-processing approaches of the attention signal of CodeGen against our ground truth of developers exploring code, including the novel concept of follow-up attention which exhibits the highest agreement between model and human attention. Our follow-up attention method can predict the next line a developer will look at with 47% accuracy. This outperforms the baseline prediction accuracy of 42.3%, which uses the session history of other developers to recommend the next line. These results demonstrate the potential of leveraging the attention signal of pre-trained models for effective code exploration.
Paper Structure (24 sections, 2 equations, 9 figures, 3 tables, 1 algorithm)

This paper contains 24 sections, 2 equations, 9 figures, 3 tables, 1 algorithm.

Figures (9)

  • Figure 1: Example of sensemaking task with code and question to be answered in the bottom comment. Completely empty lines have been removed for space reasons.
  • Figure 2: Overview of the three extraction functions for the visual attention vector and the interaction matrix, both follow-up and mean. Note that $a$ and $b$ represent specific aggregation functions as explained in the text (e.g., mean, max or sum). The darker the red color, the more attention is paid to by token on the row $i$ to the token on the column $j$.
  • Figure 3: Example of two events where the yellow area corresponds to their contribution to the connection strength between from token $i$ to token $j$.
  • Figure 4: The strength of the connection ${S}_{i,j}$ depends significantly on the difference $i - j$. Both cases $i>j$ and $j>i$ can be well modelled using a Weibull distribution.
  • Figure 5: Percentage of correct, wrong, and partially correct answers for developers and model.
  • ...and 4 more figures