Development and Benchmarking of Multilingual Code Clone Detector

Wenqing Zhu; Norihiro Yoshida; Toshihiro Kamiya; Eunjong Choi; Hiroaki Takada

Development and Benchmarking of Multilingual Code Clone Detector

Wenqing Zhu, Norihiro Yoshida, Toshihiro Kamiya, Eunjong Choi, Hiroaki Takada

TL;DR

A multilingual code block extraction method based on ANTLR parser generation, and a multilingual code clone detector (MSCCD), which supports the most significant number of languages currently available and has the ability to detect Type-3 code clones, are proposed.

Abstract

The diversity of programming languages is growing, making the language extensibility of code clone detectors crucial. However, this is challenging for most existing clone detection detectors because the source code handler needs modifications, which require specialist-level knowledge of the targeted language and is time-consuming. Multilingual code clone detectors make it easier to add new language support by providing syntax information of the target language only. To address the shortcomings of existing multilingual detectors for language scalability and detection performance, we propose a multilingual code block extraction method based on ANTLR parser generation, and implement a multilingual code clone detector (MSCCD), which supports the most significant number of languages currently available and has the ability to detect Type-3 code clones. We follow the methodology of previous studies to evaluate the detection performance of the Java language. Compared to ten state-of-the-art detectors, MSCCD performs at an average level while it also supports a significantly larger number of languages. Furthermore, we propose the first multilingual syntactic code clone evaluation benchmark based on the CodeNet database. Our results reveal that even when applying the same detection approach, performance can vary markedly depending on the language of the source code under investigation. Overall, MSCCD is the most balanced one among the evaluated tools when considering detection performance and language extensibility.

Development and Benchmarking of Multilingual Code Clone Detector

TL;DR

Abstract

Paper Structure (35 sections, 4 equations, 9 figures, 13 tables, 2 algorithms)

This paper contains 35 sections, 4 equations, 9 figures, 13 tables, 2 algorithms.

Introduction
Background
Terminology
Multilingual Code Clone Detection
Proposed Tool: MSCCD
Overview of MSCCD
Code Block Extraction using Parse Tree
Clone Detection
Evaluation using BigCloneBench
Recall
Precision
Execution Time
Multilingually Measuring Recall&Precision of Code Clone Detectors
Introduction of the benchmark
Benchmarking of Recall
...and 20 more sections

Figures (9)

Figure 1: General clone detection process
Figure 2: Overview of MSCCD
Figure 3: Multilingual Code Block Partition by Parse Tree MSCCD
Figure 4: Simplification of a Parse Tree
Figure 5: Process of Evaluating Recall&Precision in Various Languages
...and 4 more figures

Development and Benchmarking of Multilingual Code Clone Detector

TL;DR

Abstract

Development and Benchmarking of Multilingual Code Clone Detector

Authors

TL;DR

Abstract

Table of Contents

Figures (9)