NeSy is alive and well: A LLM-driven symbolic approach for better code comment data generation and classification

Hanna Abi Akl

NeSy is alive and well: A LLM-driven symbolic approach for better code comment data generation and classification

Hanna Abi Akl

TL;DR

The paper tackles data scarcity in code-comment classification and proposes a neuro-symbolic workflow that combines an LLM agent with a symbolic rule-based system to generate labeled synthetic data for C code comments. Semantic rules, explicit algorithms, and a Python data generator are used to control data quality and labeling, addressing weaknesses of purely LLM-driven data. Experiments show that augmenting the baseline with 5000 synthetic samples yields consistent Macro-F1 gains across RF, VC, and NN, with NN achieving $91.412\%$ Macro-F1 on augmented data. The approach demonstrates the viability of controlled synthetic data in improving code-related NLP tasks and suggests scalability to larger datasets and other domains.

Abstract

We present a neuro-symbolic (NeSy) workflow combining a symbolic-based learning technique with a large language model (LLM) agent to generate synthetic data for code comment classification in the C programming language. We also show how generating controlled synthetic data using this workflow fixes some of the notable weaknesses of LLM-based generation and increases the performance of classical machine learning models on the code comment classification task. Our best model, a Neural Network, achieves a Macro-F1 score of 91.412% with an increase of 1.033% after data augmentation.

NeSy is alive and well: A LLM-driven symbolic approach for better code comment data generation and classification

TL;DR

Macro-F1 on augmented data. The approach demonstrates the viability of controlled synthetic data in improving code-related NLP tasks and suggests scalability to larger datasets and other domains.

Abstract

Paper Structure (19 sections, 5 figures, 3 tables, 1 algorithm)

This paper contains 19 sections, 5 figures, 3 tables, 1 algorithm.

Introduction
Related Work
Symbolic techniques and large language models
Synthetic data generation methods
Methodology
Semantic rules
Algorithm generation
Script creation
Experiments
Dataset description
Baseline data
Additional data
System description
Model choice
Features
...and 4 more sections

Figures (5)

Figure 1: Example of labeled code comment data
Figure 2: High-level architecture of neuro-symbolic synthetic data generation workflow
Figure 3: Example of rule-based prompting using semantic decomposition
Figure 4: Example of valid labeled code comment data samples generated by ChatGPT
Figure 5: Python script generation by ChatGPT

NeSy is alive and well: A LLM-driven symbolic approach for better code comment data generation and classification

TL;DR

Abstract

NeSy is alive and well: A LLM-driven symbolic approach for better code comment data generation and classification

Authors

TL;DR

Abstract

Table of Contents

Figures (5)