Beyond Self-learned Attention: Mitigating Attention Bias in Transformer-based Models Using Attention Guidance

Jiri Gesi; Iftekhar Ahmed

Beyond Self-learned Attention: Mitigating Attention Bias in Transformer-based Models Using Attention Guidance

Jiri Gesi, Iftekhar Ahmed

TL;DR

The paper identifies attention bias in fine-tuned Transformer-based language models for software engineering, showing that attention weights disproportionately focus on certain syntax tokens and AST elements during correct predictions. It introduces SyntaGuid, a syntax-pattern attention guiding mechanism that combines MLM with a dedicated SAG loss to steer attention toward critical code tokens and AST structures during fine-tuning. Empirical results across cloze tests, code clone detection, and code translation demonstrate that SyntaGuid yields up to a 3.25% overall performance improvement and fixes up to 28.3% of previously incorrect predictions, with significant gains over baseline and existing attention-guiding methods. The work has practical implications for model interpretability and robustness in software engineering tasks and provides data and replication resources for further research.

Abstract

Transformer-based models have demonstrated considerable potential for source code modeling tasks in software engineering. However, they are limited by their dependence solely on automatic self-attention weight learning mechanisms. Previous studies have shown that these models overemphasize delimiters added by tokenizers (e.g., [CLS], [SEP]), which may lead to overlooking essential information in the original input source code. To address this challenge, we introduce SyntaGuid, a novel approach that utilizes the observation that attention weights tend to be biased towards specific source code syntax tokens and abstract syntax tree (AST) elements in fine-tuned language models when they make correct predictions. SyntaGuid facilitates the guidance of attention-weight learning, leading to improved model performance on various software engineering tasks. We evaluate the effectiveness of SyntaGuid on multiple tasks and demonstrate that it outperforms existing state-of-the-art models in overall performance without requiring additional data. Experimental result shows that SyntaGuid can improve overall performance up to 3.25% and fix up to 28.3% wrong predictions. Our work represents the first attempt to guide the attention of Transformer-based models towards critical source code tokens during fine-tuning, highlighting the potential for enhancing Transformer-based models in software engineering.

Beyond Self-learned Attention: Mitigating Attention Bias in Transformer-based Models Using Attention Guidance

TL;DR

Abstract

Paper Structure (32 sections, 9 equations, 7 figures, 3 tables)

This paper contains 32 sections, 9 equations, 7 figures, 3 tables.

Introduction
Background
Self-attention-based Transformer Model
Pre-training Language Model
Pre-trained models for source code
Empirical Analysis of Attention Weight Assignment Bias
Study Design
Experiment tasks
Selected syntax types and AST statements
Attention weight bias analysis
Attention bias impact analysis
SyntaGuid: Syntax Pattern Attention Guiding
Masked Language Modeling
Syntax Pattern Attention Guiding
Syntax attention patterns
...and 17 more sections

Figures (7)

Figure 1: Illustration of attention guiding mechanism
Figure 2: Example attention guiding patterns for code snippet "<s> sum = num1 + num2; <\\ s>", whose syntax type list is: [[CLS], identifier, operator, identifier, operator, identifier, separator, [SEP]].
Figure 3: Comparison of attention weights on syntax tokens: Correctly Predicted vs. Mis-predicted groups
Figure 4: Comparison of attention weights on AST statements: Correctly Predicted vs. Mis-predicted groups
Figure 5: Comparison of model accuracy based on Syntax attention weight: Low vs. High attention weights
...and 2 more figures

Beyond Self-learned Attention: Mitigating Attention Bias in Transformer-based Models Using Attention Guidance

TL;DR

Abstract

Beyond Self-learned Attention: Mitigating Attention Bias in Transformer-based Models Using Attention Guidance

Authors

TL;DR

Abstract

Table of Contents

Figures (7)