Table of Contents
Fetching ...

NeSy is alive and well: A LLM-driven symbolic approach for better code comment data generation and classification

Hanna Abi Akl

TL;DR

The paper tackles data scarcity in code-comment classification and proposes a neuro-symbolic workflow that combines an LLM agent with a symbolic rule-based system to generate labeled synthetic data for C code comments. Semantic rules, explicit algorithms, and a Python data generator are used to control data quality and labeling, addressing weaknesses of purely LLM-driven data. Experiments show that augmenting the baseline with 5000 synthetic samples yields consistent Macro-F1 gains across RF, VC, and NN, with NN achieving $91.412\%$ Macro-F1 on augmented data. The approach demonstrates the viability of controlled synthetic data in improving code-related NLP tasks and suggests scalability to larger datasets and other domains.

Abstract

We present a neuro-symbolic (NeSy) workflow combining a symbolic-based learning technique with a large language model (LLM) agent to generate synthetic data for code comment classification in the C programming language. We also show how generating controlled synthetic data using this workflow fixes some of the notable weaknesses of LLM-based generation and increases the performance of classical machine learning models on the code comment classification task. Our best model, a Neural Network, achieves a Macro-F1 score of 91.412% with an increase of 1.033% after data augmentation.

NeSy is alive and well: A LLM-driven symbolic approach for better code comment data generation and classification

TL;DR

The paper tackles data scarcity in code-comment classification and proposes a neuro-symbolic workflow that combines an LLM agent with a symbolic rule-based system to generate labeled synthetic data for C code comments. Semantic rules, explicit algorithms, and a Python data generator are used to control data quality and labeling, addressing weaknesses of purely LLM-driven data. Experiments show that augmenting the baseline with 5000 synthetic samples yields consistent Macro-F1 gains across RF, VC, and NN, with NN achieving Macro-F1 on augmented data. The approach demonstrates the viability of controlled synthetic data in improving code-related NLP tasks and suggests scalability to larger datasets and other domains.

Abstract

We present a neuro-symbolic (NeSy) workflow combining a symbolic-based learning technique with a large language model (LLM) agent to generate synthetic data for code comment classification in the C programming language. We also show how generating controlled synthetic data using this workflow fixes some of the notable weaknesses of LLM-based generation and increases the performance of classical machine learning models on the code comment classification task. Our best model, a Neural Network, achieves a Macro-F1 score of 91.412% with an increase of 1.033% after data augmentation.
Paper Structure (19 sections, 5 figures, 3 tables, 1 algorithm)

This paper contains 19 sections, 5 figures, 3 tables, 1 algorithm.

Figures (5)

  • Figure 1: Example of labeled code comment data
  • Figure 2: High-level architecture of neuro-symbolic synthetic data generation workflow
  • Figure 3: Example of rule-based prompting using semantic decomposition
  • Figure 4: Example of valid labeled code comment data samples generated by ChatGPT
  • Figure 5: Python script generation by ChatGPT