CONCORD: Towards a DSL for Configurable Graph Code Representation
Mootez Saad, Tushar Sharma
TL;DR
The paper tackles the problem of building flexible, scalable, and interoperable graph representations of source code for DL tasks across programming languages. It introduces CONCORD, a DSL that unifies edge augmentation techniques and enables configurable graph construction, together with two pruning methods to reduce graph size. A code smell detection case study demonstrates that CONCORD can maintain or improve performance while significantly reducing computational cost, reporting up to 100% performance and a 10.15% reduction in computations. The work provides a replication package to improve reproducibility and reduce engineering effort in graph-based code analysis.
Abstract
Deep learning is widely used to uncover hidden patterns in large code corpora. To achieve this, constructing a format that captures the relevant characteristics and features of source code is essential. Graph-based representations have gained attention for their ability to model structural and semantic information. However, existing tools lack flexibility in constructing graphs across different programming languages, limiting their use. Additionally, the output of these tools often lacks interoperability and results in excessively large graphs, making graph-based neural networks training slower and less scalable. We introduce CONCORD, a domain-specific language to build customizable graph representations. It implements reduction heuristics to reduce graphs' size complexity. We demonstrate its effectiveness in code smell detection as an illustrative use case and show that: first, CONCORD can produce code representations automatically per the specified configuration, and second, our heuristics can achieve comparable performance with significantly reduced size. CONCORD will help researchers a) create and experiment with customizable graph-based code representations for different software engineering tasks involving DL, b) reduce the engineering work to generate graph representations, c) address the issue of scalability in GNN models, and d) enhance the reproducibility of experiments in research through a standardized approach to code representation and analysis.
