Table of Contents
Fetching ...

The Struggles of LLMs in Cross-lingual Code Clone Detection

Micheline Bénédicte Moumoula, Abdoul Kader Kabore, Jacques Klein, Tegawendé Bissyande

TL;DR

This work evaluates five LLMs and embedding-based approaches for cross-lingual code clone detection on XLCoST and CodeNet, revealing that while LLMs can achieve very high accuracy on simple cases, embeddings with classifiers achieve superior, more consistent performance across datasets. Prompt engineering, especially an improved simple prompt, enhances LLM reasoning but still lags behind embedding-based methods. The study highlights the strengths of language-agnostic code representations and raises questions about LLMs’ true understanding of cross-lingual clones, data leakage, and dataset realism. Practically, robust embeddings plus classifiers offer a strong, scalable solution for multilingual code analysis, with LLMs serving as a complementary tool when guided by explicit task definitions. These findings guide future work toward richer representations and more robust benchmarking for cross-language software engineering tasks.

Abstract

With the involvement of multiple programming languages in modern software development, cross-lingual code clone detection has gained traction within the software engineering community. Numerous studies have explored this topic, proposing various promising approaches. Inspired by the significant advances in machine learning in recent years, particularly Large Language Models (LLMs), which have demonstrated their ability to tackle various tasks, this paper revisits cross-lingual code clone detection. We evaluate the performance of five (05) LLMs and eight prompts (08) for the identification of cross-lingual code clones. Additionally, we compare these results against two baseline methods. Finally, we evaluate a pre-trained embedding model to assess the effectiveness of the generated representations for classifying clone and non-clone pairs. The studies involving LLMs and Embedding models are evaluated using two widely used cross-lingual datasets, XLCoST and CodeNet. Our results show that LLMs can achieve high F1 scores, up to 0.99, for straightforward programming examples. However, they not only perform less well on programs associated with complex programming challenges but also do not necessarily understand the meaning of "code clones" in a cross-lingual setting. We show that embedding models used to represent code fragments from different programming languages in the same representation space enable the training of a basic classifier that outperforms all LLMs by ~1 and ~20 percentage points on the XLCoST and CodeNet datasets, respectively. This finding suggests that, despite the apparent capabilities of LLMs, embeddings provided by embedding models offer suitable representations to achieve state-of-the-art performance in cross-lingual code clone detection.

The Struggles of LLMs in Cross-lingual Code Clone Detection

TL;DR

This work evaluates five LLMs and embedding-based approaches for cross-lingual code clone detection on XLCoST and CodeNet, revealing that while LLMs can achieve very high accuracy on simple cases, embeddings with classifiers achieve superior, more consistent performance across datasets. Prompt engineering, especially an improved simple prompt, enhances LLM reasoning but still lags behind embedding-based methods. The study highlights the strengths of language-agnostic code representations and raises questions about LLMs’ true understanding of cross-lingual clones, data leakage, and dataset realism. Practically, robust embeddings plus classifiers offer a strong, scalable solution for multilingual code analysis, with LLMs serving as a complementary tool when guided by explicit task definitions. These findings guide future work toward richer representations and more robust benchmarking for cross-language software engineering tasks.

Abstract

With the involvement of multiple programming languages in modern software development, cross-lingual code clone detection has gained traction within the software engineering community. Numerous studies have explored this topic, proposing various promising approaches. Inspired by the significant advances in machine learning in recent years, particularly Large Language Models (LLMs), which have demonstrated their ability to tackle various tasks, this paper revisits cross-lingual code clone detection. We evaluate the performance of five (05) LLMs and eight prompts (08) for the identification of cross-lingual code clones. Additionally, we compare these results against two baseline methods. Finally, we evaluate a pre-trained embedding model to assess the effectiveness of the generated representations for classifying clone and non-clone pairs. The studies involving LLMs and Embedding models are evaluated using two widely used cross-lingual datasets, XLCoST and CodeNet. Our results show that LLMs can achieve high F1 scores, up to 0.99, for straightforward programming examples. However, they not only perform less well on programs associated with complex programming challenges but also do not necessarily understand the meaning of "code clones" in a cross-lingual setting. We show that embedding models used to represent code fragments from different programming languages in the same representation space enable the training of a basic classifier that outperforms all LLMs by ~1 and ~20 percentage points on the XLCoST and CodeNet datasets, respectively. This finding suggests that, despite the apparent capabilities of LLMs, embeddings provided by embedding models offer suitable representations to achieve state-of-the-art performance in cross-lingual code clone detection.
Paper Structure (20 sections, 4 figures, 6 tables)

This paper contains 20 sections, 4 figures, 6 tables.

Figures (4)

  • Figure 1: Experimental workflow for cross-lingual code clone detection using LLMs vs Classification Models
  • Figure 2: Performance of GPT-3.5-Turbo prediction of Similarity Scores on the task of Cross-lingual Code Clone Detection
  • Figure 3: In (a) and (b), we first measure the performance of Text-embedding-3-large mixed with cosine similarity at different thresholds for each dataset. Using a threshold of 0.5, we achieved an F1-score of 0.98 on the XLCOST dataset and 0.74 on the Codenet dataset.
  • Figure 4: Impact of the binary classification on cross-lingual code clone detection using code embedding