Hold On! Is My Feedback Useful? Evaluating the Usefulness of Code Review Comments
Sharif Ahmed, Nasir U. Eisty
TL;DR
This paper tackles predicting the usefulness of Code Review Comments (CR comments) using both handcrafted textual features and featureless approaches across three datasets drawn from commercial and open-source projects. It introduces new features capturing code elements, voice/tone, text structure, and jargon-density, and evaluates Bag-of-Words with TF-IDF, pre-trained embeddings, and fine-tuned transformer models, including GPT-4o. The findings show that GPT-4o and BoW-TFIDF often outperform baselines in different settings, but cross-project generalization remains limited, with explanations provided via SHAP aiding interpretability. The work advances understanding of what textual signals signal usefulness, offers practical guidance for developers writing CR comments, and motivates the creation of larger, more diverse datasets for robust cross-domain evaluation.
Abstract
Context: In collaborative software development, the peer code review process proves beneficial only when the reviewers provide useful comments. Objective: This paper investigates the usefulness of Code Review Comments (CR comments) through textual feature-based and featureless approaches. Method: We select three available datasets from both open-source and commercial projects. Additionally, we introduce new features from software and non-software domains. Moreover, we experiment with the presence of jargon, voice, and codes in CR comments and classify the usefulness of CR comments through featurization, bag-of-words, and transfer learning techniques. Results: Our models outperform the baseline by achieving state-of-the-art performance. Furthermore, the result demonstrates that the commercial gigantic LLM, GPT-4o, or non-commercial naive featureless approach, Bag-of-Word with TF-IDF, is more effective for predicting the usefulness of CR comments. Conclusion: The significant improvement in predicting usefulness solely from CR comments escalates research on this task. Our analyses portray the similarities and differences of domains, projects, datasets, models, and features for predicting the usefulness of CR comments.
