Toward Neurosymbolic Program Comprehension
Alejandro Velasco, Aya Garryyeva, David N. Palacio, Antonio Mastropaolo, Denys Poshyvanyk
TL;DR
The paper tackles the limitations of scaling Large Code Models by proposing a Neurosymbolic Program Comprehension (NsPC) framework that fuses probabilistic DL with symbolic rules to improve interpretability and determinism in program understanding. It leverages $SHAP$ values to identify patterns in code-token contributions and translates these into symbolic rules that can guide post-training adjustments. A Java vulnerability-detection case study using CodeBERT demonstrates interpretable SHAP-driven patterns linked to specific AST types and token positions, validating the core idea while acknowledging model- and dataset-specific limitations. The work sets the stage for a principled, neurosymbolic approach to program comprehension and outlines future efforts to formalize the theory and validate rules with human input.
Abstract
Recent advancements in Large Language Models (LLMs) have paved the way for Large Code Models (LCMs), enabling automation in complex software engineering tasks, such as code generation, software testing, and program comprehension, among others. Tools like GitHub Copilot and ChatGPT have shown substantial benefits in supporting developers across various practices. However, the ambition to scale these models to trillion-parameter sizes, exemplified by GPT-4, poses significant challenges that limit the usage of Artificial Intelligence (AI)-based systems powered by large Deep Learning (DL) models. These include rising computational demands for training and deployment and issues related to trustworthiness, bias, and interpretability. Such factors can make managing these models impractical for many organizations, while their "black-box'' nature undermines key aspects, including transparency and accountability. In this paper, we question the prevailing assumption that increasing model parameters is always the optimal path forward, provided there is sufficient new data to learn additional patterns. In particular, we advocate for a Neurosymbolic research direction that combines the strengths of existing DL techniques (e.g., LLMs) with traditional symbolic methods--renowned for their reliability, speed, and determinism. To this end, we outline the core features and present preliminary results for our envisioned approach, aimed at establishing the first Neurosymbolic Program Comprehension (NsPC) framework to aid in identifying defective code components.
