Table of Contents
Fetching ...

CodeFuse-CommitEval: Towards Benchmarking LLM's Power on Commit Message and Code Change Inconsistency Detection

Qingyu Zhang, Puzhuo Liu, Peng Di, Chenxiong Qian

TL;DR

This work addresses the problem of commit message and code change misalignment (MCI) by introducing CodeFuse-CommitEval, the first benchmark tailored for MCI detection using LLMs. It constructs a diverse, type-aware dataset by mutating consistent ApacheCM commits across seven rules and validating samples via two-fold verification, enabling rigorous evaluation across six open-source LLMs under vanilla, few-shot, chain-of-thought, and extended-context prompts. Empirical results show LLMs detect inconsistent commits more reliably than consistent ones (average Recall ≈86%, Precision ≈80%), with gpt-oss-20B delivering the best overall performance at the cost of higher token usage; augmentation strategies yield heterogeneous effects depending on model size and inconsistency type. Type-wise analysis reveals higher detectability for component, file-path, and operation inconsistencies, while intent-level purpose inconsistencies remain challenging, underscoring the need for richer context and more balanced data to capture high-level semantic gaps. Overall, CodeFuse-CommitEval provides a rigorous foundation for measuring, comparing, and driving improvements in MCI detection for software engineering applications.

Abstract

Version control relies on commit messages to convey the rationale for code changes, but these messages are often low quality and, more critically, inconsistent with their diffs-known as message-code inconsistency (MCI). MCIs mislead reviewers, hinder maintenance, contaminate research datasets, and may obscure security patches. Yet, no dedicated benchmark exists to evaluate models for MCI detection. We introduce CODEFUSE-COMMITEVAL, the first benchmark designed for MCI detection using large language models (LLMs). Built on the ApacheCM dataset for diversity and quality, we generate seven types of inconsistent messages through rule-guided mutations of originally consistent commits and apply two-fold validation to verify both positive and negative samples. Using this labeled dataset of message-diff pairs, we evaluate six state-of-the-art open-source LLMs under a vanilla setting and with three augmentation strategies: few-shot prompting, chain-of-thought, and extended context. Results show models detect inconsistent commits more reliably than consistent ones (average Recall 85.95%, Precision 80.28%, Specificity 63.8%); gpt-oss-20B performs best overall but uses over twice the tokens of others. Augmentation effects vary: adjacent context helps larger models but adds noise for smaller ones; few-shot improves accuracy and reduces token use, yet increases universally incorrect predictions; chain-of-thought boosts precision and specificity at the cost of recall and higher token consumption. Type-wise analysis reveals higher detectability for component, file-path, and operation inconsistencies, but lower accuracy and higher token cost for intent-level "purpose" inconsistencies. CODEFUSE-COMMITEVAL provides a rigorous foundation for measuring, comparing, and advancing MCI detection, highlighting the need for richer context and balanced data to capture high-level semantic gaps.

CodeFuse-CommitEval: Towards Benchmarking LLM's Power on Commit Message and Code Change Inconsistency Detection

TL;DR

This work addresses the problem of commit message and code change misalignment (MCI) by introducing CodeFuse-CommitEval, the first benchmark tailored for MCI detection using LLMs. It constructs a diverse, type-aware dataset by mutating consistent ApacheCM commits across seven rules and validating samples via two-fold verification, enabling rigorous evaluation across six open-source LLMs under vanilla, few-shot, chain-of-thought, and extended-context prompts. Empirical results show LLMs detect inconsistent commits more reliably than consistent ones (average Recall ≈86%, Precision ≈80%), with gpt-oss-20B delivering the best overall performance at the cost of higher token usage; augmentation strategies yield heterogeneous effects depending on model size and inconsistency type. Type-wise analysis reveals higher detectability for component, file-path, and operation inconsistencies, while intent-level purpose inconsistencies remain challenging, underscoring the need for richer context and more balanced data to capture high-level semantic gaps. Overall, CodeFuse-CommitEval provides a rigorous foundation for measuring, comparing, and driving improvements in MCI detection for software engineering applications.

Abstract

Version control relies on commit messages to convey the rationale for code changes, but these messages are often low quality and, more critically, inconsistent with their diffs-known as message-code inconsistency (MCI). MCIs mislead reviewers, hinder maintenance, contaminate research datasets, and may obscure security patches. Yet, no dedicated benchmark exists to evaluate models for MCI detection. We introduce CODEFUSE-COMMITEVAL, the first benchmark designed for MCI detection using large language models (LLMs). Built on the ApacheCM dataset for diversity and quality, we generate seven types of inconsistent messages through rule-guided mutations of originally consistent commits and apply two-fold validation to verify both positive and negative samples. Using this labeled dataset of message-diff pairs, we evaluate six state-of-the-art open-source LLMs under a vanilla setting and with three augmentation strategies: few-shot prompting, chain-of-thought, and extended context. Results show models detect inconsistent commits more reliably than consistent ones (average Recall 85.95%, Precision 80.28%, Specificity 63.8%); gpt-oss-20B performs best overall but uses over twice the tokens of others. Augmentation effects vary: adjacent context helps larger models but adds noise for smaller ones; few-shot improves accuracy and reduces token use, yet increases universally incorrect predictions; chain-of-thought boosts precision and specificity at the cost of recall and higher token consumption. Type-wise analysis reveals higher detectability for component, file-path, and operation inconsistencies, but lower accuracy and higher token cost for intent-level "purpose" inconsistencies. CODEFUSE-COMMITEVAL provides a rigorous foundation for measuring, comparing, and advancing MCI detection, highlighting the need for richer context and balanced data to capture high-level semantic gaps.

Paper Structure

This paper contains 34 sections, 5 equations, 5 figures, 5 tables.

Figures (5)

  • Figure 1: Pipeline of CodeFuse-CommitEval.
  • Figure 2: CodeFuse-CommitEval's positive sample generation stage.
  • Figure 3: Statistical results of CodeFuse-CommitEval's dataset in terms of Message character count, Diff file count, and Diff line of code.
  • Figure 4: Correct answers' count distribution among the six targeted models.
  • Figure 5: Correct answers' count distribution among the six targeted models with the three augmentation strategies.