Table of Contents
Fetching ...

When simplicity meets effectiveness: Detecting code comments coherence with word embeddings and LSTM

Michael Dubem Igbomezie, Phuong T. Nguyen, Davide Di Ruscio

TL;DR

The paper tackles code-comment coherence detection, a problem where comments and their corresponding code can diverge after evolution, hindering understanding. It proposes Co3D, a pragmatic approach that compares three configurations: C1 (word2vec with Simple RNN), C2 (word2vec with LSTM), and C3 (CodeBERT-based tokenization), to balance accuracy and computational cost. Experiments on a real-world Java dataset show that the lightweight C1/C2 setups achieve competitive accuracy and often outperform baselines like SVM and even CodeBERT in certain metrics, while CodeBERT remains strong but costlier. The results suggest that careful model selection and data encoding can yield effective, energy-efficient code-comment coherence detection, with implications for practical SE tooling and future comparisons with LLMs.

Abstract

Code comments play a crucial role in software development, as they provide programmers with practical information, allowing them to understand better the intent and semantics of the underpinning code. Nevertheless, developers tend to leave comments unchanged after updating the code, resulting in a discrepancy between the two artifacts. Such a discrepancy may trigger misunderstanding and confusion among developers, impeding various activities, including code comprehension and maintenance. Thus, it is crucial to identify if, given a code snippet, its corresponding comment is coherent and reflects well the intent behind the code. Unfortunately, existing approaches to this problem, while obtaining an encouraging performance, either rely on heavily pre-trained models, or treat input data as text, neglecting the intrinsic features contained in comments and code, including word order and synonyms. This work presents Co3D as a practical approach to the detection of code comment coherence. We pay attention to internal meaning of words and sequential order of words in text while predicting coherence in code-comment pairs. We deployed a combination of Gensim word2vec encoding and a simple recurrent neural network, a combination of Gensim word2vec encoding and an LSTM model, and CodeBERT. The experimental results show that Co3D obtains a promising prediction performance, thus outperforming well-established baselines. We conclude that depending on the context, using a simple architecture can introduce a satisfying prediction.

When simplicity meets effectiveness: Detecting code comments coherence with word embeddings and LSTM

TL;DR

The paper tackles code-comment coherence detection, a problem where comments and their corresponding code can diverge after evolution, hindering understanding. It proposes Co3D, a pragmatic approach that compares three configurations: C1 (word2vec with Simple RNN), C2 (word2vec with LSTM), and C3 (CodeBERT-based tokenization), to balance accuracy and computational cost. Experiments on a real-world Java dataset show that the lightweight C1/C2 setups achieve competitive accuracy and often outperform baselines like SVM and even CodeBERT in certain metrics, while CodeBERT remains strong but costlier. The results suggest that careful model selection and data encoding can yield effective, energy-efficient code-comment coherence detection, with implications for practical SE tooling and future comparisons with LLMs.

Abstract

Code comments play a crucial role in software development, as they provide programmers with practical information, allowing them to understand better the intent and semantics of the underpinning code. Nevertheless, developers tend to leave comments unchanged after updating the code, resulting in a discrepancy between the two artifacts. Such a discrepancy may trigger misunderstanding and confusion among developers, impeding various activities, including code comprehension and maintenance. Thus, it is crucial to identify if, given a code snippet, its corresponding comment is coherent and reflects well the intent behind the code. Unfortunately, existing approaches to this problem, while obtaining an encouraging performance, either rely on heavily pre-trained models, or treat input data as text, neglecting the intrinsic features contained in comments and code, including word order and synonyms. This work presents Co3D as a practical approach to the detection of code comment coherence. We pay attention to internal meaning of words and sequential order of words in text while predicting coherence in code-comment pairs. We deployed a combination of Gensim word2vec encoding and a simple recurrent neural network, a combination of Gensim word2vec encoding and an LSTM model, and CodeBERT. The experimental results show that Co3D obtains a promising prediction performance, thus outperforming well-established baselines. We conclude that depending on the context, using a simple architecture can introduce a satisfying prediction.
Paper Structure (18 sections, 4 figures, 3 tables)

This paper contains 18 sections, 4 figures, 3 tables.

Figures (4)

  • Figure 1: Examples of coherent and incoherent code comment pairs (Extracted from the dataset of Corazza et al.DBLP:journals/sqj/CorazzaMS18).
  • Figure 2: The overall architecture.
  • Figure 3: A selected sample code-comment pair used for the sake of model interpretation.
  • Figure 4: Visualization of the importance of each word in the final predicted class.