Table of Contents
Fetching ...

AdaCCD: Adaptive Semantic Contrasts Discovery Based Cross Lingual Adaptation for Code Clone Detection

Yangkai Du, Tengfei Ma, Lingfei Wu, Xuhong Zhang, Shouling Ji

TL;DR

AdaCCD addresses the lack of annotated data for cross-language code clone detection by leveraging language-agnostic representations from pre-trained programming-language models and an Adaptively Refined Contrastive Learning framework. It discovers semantically meaningful contrasts in unlabeled target code via clustering and neighborhood search, and reinforces them with semantic-preserving transformations (back translation and identifier renaming) under a tunable adaptive balance. An iterative bootstrapping regime uses the enhanced model to refine contrasts and expand cross-lingual coverage, yielding strong gains across five languages with backbones CodeBERT and GraphCodeBERT, and achieving performance comparable to supervised fine-tuning when modest labeled data is available. This approach reduces annotation bottlenecks in multilingual software engineering and offers a scalable path to robust cross-language clone detection.

Abstract

Code Clone Detection, which aims to retrieve functionally similar programs from large code bases, has been attracting increasing attention. Modern software often involves a diverse range of programming languages. However, current code clone detection methods are generally limited to only a few popular programming languages due to insufficient annotated data as well as their own model design constraints. To address these issues, we present AdaCCD, a novel cross-lingual adaptation method that can detect cloned codes in a new language without annotations in that language. AdaCCD leverages language-agnostic code representations from pre-trained programming language models and propose an Adaptively Refined Contrastive Learning framework to transfer knowledge from resource-rich languages to resource-poor languages. We evaluate the cross-lingual adaptation results of AdaCCD by constructing a multilingual code clone detection benchmark consisting of 5 programming languages. AdaCCD achieves significant improvements over other baselines, and achieve comparable performance to supervised fine-tuning.

AdaCCD: Adaptive Semantic Contrasts Discovery Based Cross Lingual Adaptation for Code Clone Detection

TL;DR

AdaCCD addresses the lack of annotated data for cross-language code clone detection by leveraging language-agnostic representations from pre-trained programming-language models and an Adaptively Refined Contrastive Learning framework. It discovers semantically meaningful contrasts in unlabeled target code via clustering and neighborhood search, and reinforces them with semantic-preserving transformations (back translation and identifier renaming) under a tunable adaptive balance. An iterative bootstrapping regime uses the enhanced model to refine contrasts and expand cross-lingual coverage, yielding strong gains across five languages with backbones CodeBERT and GraphCodeBERT, and achieving performance comparable to supervised fine-tuning when modest labeled data is available. This approach reduces annotation bottlenecks in multilingual software engineering and offers a scalable path to robust cross-language clone detection.

Abstract

Code Clone Detection, which aims to retrieve functionally similar programs from large code bases, has been attracting increasing attention. Modern software often involves a diverse range of programming languages. However, current code clone detection methods are generally limited to only a few popular programming languages due to insufficient annotated data as well as their own model design constraints. To address these issues, we present AdaCCD, a novel cross-lingual adaptation method that can detect cloned codes in a new language without annotations in that language. AdaCCD leverages language-agnostic code representations from pre-trained programming language models and propose an Adaptively Refined Contrastive Learning framework to transfer knowledge from resource-rich languages to resource-poor languages. We evaluate the cross-lingual adaptation results of AdaCCD by constructing a multilingual code clone detection benchmark consisting of 5 programming languages. AdaCCD achieves significant improvements over other baselines, and achieve comparable performance to supervised fine-tuning.
Paper Structure (31 sections, 5 equations, 4 figures, 4 tables)

This paper contains 31 sections, 5 equations, 4 figures, 4 tables.

Figures (4)

  • Figure 1: Zero-Shot Adaptation Results. MAP@R: Mean Average Precision @ R, which measures how accurately a model can retrieve similar items given a query.
  • Figure 2: Overview of AdaCCD. We give an example when adapting to Rust language.
  • Figure 3: NMI evaluated at the end of epochs.
  • Figure 4: Sensitivity test of $\alpha_0$ under GCB-BT setting. We conduct sensitivity test by adapting from POJ-104.