Table of Contents
Fetching ...

Bridge and Hint: Extending Pre-trained Language Models for Long-Range Code

Yujia Chen, Cuiyun Gao, Zezhou Yang, Hongyu Zhang, Qing Liao

TL;DR

EXPO tackles the difficulty of modeling long-range code with pre-trained language models by introducing a dual-memory framework: Bridge Memory and Hint Memory. Bridge Memory preserves contextual continuity across long sequences by propagating information between fixed-length snippets, while Hint Memory stores global code elements in a hint bank and retrieves them via a kNN attention layer to enrich local representations. Across five PLMs and two code tasks (API recommendation and vulnerability detection), EXPO yields substantial gains, outperforming strong baselines and approaching or surpassing some large language models on key metrics. This approach provides a practical, scalable path to extend PLMs for real-world codebases with long inputs, with broad implications for software engineering tooling and code understanding research.

Abstract

In the field of code intelligence, effectively modeling long-range code poses a significant challenge. Existing pre-trained language models (PLMs) such as UniXcoder have achieved remarkable success, but they still face difficulties with long code inputs. This is mainly due to their limited capacity to maintain contextual continuity and memorize the key information over long-range code. To alleviate the difficulties, we propose EXPO, a framework for EXtending Pre-trained language models for lOng-range code. EXPO incorporates two innovative memory mechanisms we propose in this paper: Bridge Memory and Hint Memory. Bridge Memory uses a tagging mechanism to connect disparate snippets of long-range code, helping the model maintain contextual coherence. Hint Memory focuses on crucial code elements throughout the global context, such as package imports, by integrating a kNN attention layer to adaptively select the relevant code elements. This dual-memory approach bridges the gap between understanding local code snippets and maintaining global code coherence, thereby enhancing the model overall comprehension of long code sequences. We validate the effectiveness of EXPO on five popular pre-trained language models such as UniXcoder and two code intelligence tasks including API recommendation and vulnerability detection. Experimental results demonstrate that EXPO significantly improves the pre-training language models.

Bridge and Hint: Extending Pre-trained Language Models for Long-Range Code

TL;DR

EXPO tackles the difficulty of modeling long-range code with pre-trained language models by introducing a dual-memory framework: Bridge Memory and Hint Memory. Bridge Memory preserves contextual continuity across long sequences by propagating information between fixed-length snippets, while Hint Memory stores global code elements in a hint bank and retrieves them via a kNN attention layer to enrich local representations. Across five PLMs and two code tasks (API recommendation and vulnerability detection), EXPO yields substantial gains, outperforming strong baselines and approaching or surpassing some large language models on key metrics. This approach provides a practical, scalable path to extend PLMs for real-world codebases with long inputs, with broad implications for software engineering tooling and code understanding research.

Abstract

In the field of code intelligence, effectively modeling long-range code poses a significant challenge. Existing pre-trained language models (PLMs) such as UniXcoder have achieved remarkable success, but they still face difficulties with long code inputs. This is mainly due to their limited capacity to maintain contextual continuity and memorize the key information over long-range code. To alleviate the difficulties, we propose EXPO, a framework for EXtending Pre-trained language models for lOng-range code. EXPO incorporates two innovative memory mechanisms we propose in this paper: Bridge Memory and Hint Memory. Bridge Memory uses a tagging mechanism to connect disparate snippets of long-range code, helping the model maintain contextual coherence. Hint Memory focuses on crucial code elements throughout the global context, such as package imports, by integrating a kNN attention layer to adaptively select the relevant code elements. This dual-memory approach bridges the gap between understanding local code snippets and maintaining global code coherence, thereby enhancing the model overall comprehension of long code sequences. We validate the effectiveness of EXPO on five popular pre-trained language models such as UniXcoder and two code intelligence tasks including API recommendation and vulnerability detection. Experimental results demonstrate that EXPO significantly improves the pre-training language models.
Paper Structure (32 sections, 4 equations, 5 figures, 5 tables)

This paper contains 32 sections, 4 equations, 5 figures, 5 tables.

Figures (5)

  • Figure 1: An example of vulnerability detection in a long-range code sequence.
  • Figure 2: The overview of EXPO.
  • Figure 3: Parameter analysis of (a)(b) $\mathbf{m}$ and (c)(d) $\mathbf{K}$ with EXPO (CodeT5) and EXPO (UniXcoder) for vulnerability detection.
  • Figure 4: An example of predictions of EXPO (UniXcoder) and the corresponding base model for the API recommendation task.
  • Figure 5: The performance of EXPO (UniXcoder) with differing input code lengths for vulnerability detection. The bars and lines indicate the results of EXPO and base models, respectively.