Table of Contents
Fetching ...

Enhancing Code LLM Training with Programmer Attention

Yifan Zhang, Chen Huang, Zachary Karas, Dung Thuy Nguyen, Kevin Leach, Yu Huang

TL;DR

The paper addresses how to leverage programmer attention signals from eye-tracking to improve code LLM training. It introduces a three-stage HumanLLM pipeline—data collection, human-centric augmentation, and reward-based fine-tuning of CodeT5—that grounds model learning in real reading behavior. Empirical results on CodeXGlue Java summarization show substantial gains in CodeBLEU, Syntax, and Dataflow, though transfer to completion and translation is uneven, highlighting task-dependent benefits. The work demonstrates the potential of integrating cognitive signals with AI for AI4SE and points to future directions for broader application across software engineering tasks.

Abstract

Human attention provides valuable yet underexploited signals for code LLM training, offering a perspective beyond purely machine-driven attention. Despite the complexity and cost of collecting eye-tracking data, there has also been limited progress in systematically using these signals for code LLM training. To address both issues, we propose a cohesive pipeline spanning augmentation and reward-based fine-tuning. Specifically, we introduce (1) an eye-tracking path augmentation method to expand programmer attention datasets, (2) a pattern abstraction step that refines raw fixations into learnable attention motifs, and (3) a reward-guided strategy for integrating these insights directly into a CodeT5 supervised fine-tuning process. Our experiments yield +7.16 in CodeBLEU on the CodeXGlue benchmark for code summarization, underscoring how uniting human and machine attention can boost code intelligence. We hope this work encourages broader exploration of human-centric methods in next-generation AI4SE.

Enhancing Code LLM Training with Programmer Attention

TL;DR

The paper addresses how to leverage programmer attention signals from eye-tracking to improve code LLM training. It introduces a three-stage HumanLLM pipeline—data collection, human-centric augmentation, and reward-based fine-tuning of CodeT5—that grounds model learning in real reading behavior. Empirical results on CodeXGlue Java summarization show substantial gains in CodeBLEU, Syntax, and Dataflow, though transfer to completion and translation is uneven, highlighting task-dependent benefits. The work demonstrates the potential of integrating cognitive signals with AI for AI4SE and points to future directions for broader application across software engineering tasks.

Abstract

Human attention provides valuable yet underexploited signals for code LLM training, offering a perspective beyond purely machine-driven attention. Despite the complexity and cost of collecting eye-tracking data, there has also been limited progress in systematically using these signals for code LLM training. To address both issues, we propose a cohesive pipeline spanning augmentation and reward-based fine-tuning. Specifically, we introduce (1) an eye-tracking path augmentation method to expand programmer attention datasets, (2) a pattern abstraction step that refines raw fixations into learnable attention motifs, and (3) a reward-guided strategy for integrating these insights directly into a CodeT5 supervised fine-tuning process. Our experiments yield +7.16 in CodeBLEU on the CodeXGlue benchmark for code summarization, underscoring how uniting human and machine attention can boost code intelligence. We hope this work encourages broader exploration of human-centric methods in next-generation AI4SE.

Paper Structure

This paper contains 27 sections, 2 equations, 5 figures, 3 tables.

Figures (5)

  • Figure 1: Overview of our pipeline (HumanLLM): ① collects eye-tracking data (red) to capture real programmer attention; ② augments these fixations with AST-based adjacency (blue) and k-gram patterns; and ③ uses these human signals to guide a reward-based CodeT5 fine-tuning.
  • Figure 2: Impact of adjacency expansions (0 to 3 lines) on semantic (left) and positional (right) labels. Wider windows generally improve Precision, Recall, and F1, exceeding a baseline Transformer (dashed lines) in five of six metrics.
  • Figure 3: Learning curves for semantic (left) and positional (right) labels over one epoch, shown via the batch-size ratio. Dashed lines represent the baseline Transformer. Larger ratios correlate with better test-set performance.
  • Figure 4: Overview of function types (left) and token-length distribution (right) for the ground-truth snippets in our dataset. The histogram on the left shows how frequently each function type appears, while the box plot on the right illustrates variability in token counts across different types.
  • Figure 5: Syntax (left) and Data Flow (right) scores by function type for Baseline vs. HumanLLM. HumanLLM generally outperforms the Baseline by better capturing control structures and variable interactions, aligning more closely with real developer fixations.