NeSy is alive and well: A LLM-driven symbolic approach for better code comment data generation and classification
Hanna Abi Akl
TL;DR
The paper tackles data scarcity in code-comment classification and proposes a neuro-symbolic workflow that combines an LLM agent with a symbolic rule-based system to generate labeled synthetic data for C code comments. Semantic rules, explicit algorithms, and a Python data generator are used to control data quality and labeling, addressing weaknesses of purely LLM-driven data. Experiments show that augmenting the baseline with 5000 synthetic samples yields consistent Macro-F1 gains across RF, VC, and NN, with NN achieving $91.412\%$ Macro-F1 on augmented data. The approach demonstrates the viability of controlled synthetic data in improving code-related NLP tasks and suggests scalability to larger datasets and other domains.
Abstract
We present a neuro-symbolic (NeSy) workflow combining a symbolic-based learning technique with a large language model (LLM) agent to generate synthetic data for code comment classification in the C programming language. We also show how generating controlled synthetic data using this workflow fixes some of the notable weaknesses of LLM-based generation and increases the performance of classical machine learning models on the code comment classification task. Our best model, a Neural Network, achieves a Macro-F1 score of 91.412% with an increase of 1.033% after data augmentation.
