Table of Contents
Fetching ...

CIDRe: A Reference-Free Multi-Aspect Criterion for Code Comment Quality Measurement

Maria Dziuba, Valentin Malykh

TL;DR

CIDRe introduces a reference-free, four-component criterion for code-comment quality that jointly measures Completeness, Informativeness, Description Length, and Relevance to assess structured comments. The method combines transformer-based informativeness weighting, multilingual embeddings, and a SIDE-inspired relevance module, producing a score in $[0,1]$ that feeds a binary good/bad classifier. On the StRuCom dataset and an independently labeled test set, CIDRe outperforms existing metrics in cross-entropy calibration, with an ablation showing full component synergy and a side-by-side evaluation demonstrating cross-language and cross-model effectiveness when filtering data. The work demonstrates practical gains in generation quality for Russian-language code documentation and points to future multilingual extensions and broader applicability.

Abstract

Effective generation of structured code comments requires robust quality metrics for dataset curation, yet existing approaches (SIDE, MIDQ, STASIS) suffer from limited code-comment analysis. We propose CIDRe, a language-agnostic reference-free quality criterion combining four synergistic aspects: (1) relevance (code-comment semantic alignment), (2) informativeness (functional coverage), (3) completeness (presence of all structure sections), and (4) description length (detail sufficiency). We validate our criterion on a manually annotated dataset. Experiments demonstrate CIDRe's superiority over existing metrics, achieving improvement in cross-entropy evaluation. When applied to filter comments, the models finetuned on CIDRe-filtered data show statistically significant quality gains in GPT-4o-mini assessments.

CIDRe: A Reference-Free Multi-Aspect Criterion for Code Comment Quality Measurement

TL;DR

CIDRe introduces a reference-free, four-component criterion for code-comment quality that jointly measures Completeness, Informativeness, Description Length, and Relevance to assess structured comments. The method combines transformer-based informativeness weighting, multilingual embeddings, and a SIDE-inspired relevance module, producing a score in that feeds a binary good/bad classifier. On the StRuCom dataset and an independently labeled test set, CIDRe outperforms existing metrics in cross-entropy calibration, with an ablation showing full component synergy and a side-by-side evaluation demonstrating cross-language and cross-model effectiveness when filtering data. The work demonstrates practical gains in generation quality for Russian-language code documentation and points to future multilingual extensions and broader applicability.

Abstract

Effective generation of structured code comments requires robust quality metrics for dataset curation, yet existing approaches (SIDE, MIDQ, STASIS) suffer from limited code-comment analysis. We propose CIDRe, a language-agnostic reference-free quality criterion combining four synergistic aspects: (1) relevance (code-comment semantic alignment), (2) informativeness (functional coverage), (3) completeness (presence of all structure sections), and (4) description length (detail sufficiency). We validate our criterion on a manually annotated dataset. Experiments demonstrate CIDRe's superiority over existing metrics, achieving improvement in cross-entropy evaluation. When applied to filter comments, the models finetuned on CIDRe-filtered data show statistically significant quality gains in GPT-4o-mini assessments.

Paper Structure

This paper contains 22 sections, 7 equations, 4 figures, 4 tables.

Figures (4)

  • Figure 1: Criteria pipeline. The proposed four quality measures are considered for the proposed list of <<code-comment>> pairs, the resulting matrix is an input to a binary classifier (class $1$ - <<good>> comments, class $0$ - <<bad>> comments), which in turn outputs a vector of probabilities that comments belong to first class. Thus, the values of our criterion are real numbers belonging to the segment $[0, 1]$.
  • Figure 2: An example of completeness calculation
  • Figure 3: Visualization of the weighting of terms by importance using different models
  • Figure 4: An example of informativeness calculation