InvAASTCluster: On Applying Invariant-Based Program Clustering to Introductory Programming Assignments
Pedro Orvalho, Mikoláš Janota, Vasco Manquinho
TL;DR
Inv-AASTCluster introduces a semantic-syntactic IPA clustering approach by combining dynamically inferred invariants with anonymized ASTs (AASTs). It encodes programs via Bag-of-Words from invariants, AASTs, or their combination, and clusters with KMeans using $K=0.1N$, achieving higher semantic separation than syntax-based methods. In experiments on 1620 correct submissions across 25 IPAs, the AAST+Invs representation attains the highest cluster accuracy (approximately $83.6\%$), outperforming invariants-only and syntax-based approaches, and when integrated with Clara, enables repairing around $13\%$ more submissions with faster turnaround. The work provides a modular, open-source framework and suggests future work in applying the representations to deep learning for fault localization and expanding to more complex programs and other programming languages.
Abstract
Due to the vast number of students enrolled in programming courses, there has been an increasing number of automated program repair techniques focused on introductory programming assignments (IPAs). Typically, such techniques use program clustering to take advantage of previous correct student implementations to repair a new incorrect submission. These repair techniques use clustering methods since analyzing all available correct submissions to repair a program is not feasible. However, conventional clustering methods rely on program representations based on features such as abstract syntax trees (ASTs), syntax, control flow, and data flow. This paper proposes InvAASTCluster, a novel approach for program clustering that uses dynamically generated program invariants to cluster semantically equivalent IPAs. InvAASTCluster's program representation uses a combination of the program's semantics, through its invariants, and its structure through its anonymized abstract syntax tree (AASTs). Invariants denote conditions that must remain true during program execution, while AASTs are ASTs devoid of variable and function names, retaining only their types. Our experiments show that the proposed program representation outperforms syntax-based representations when clustering a set of correct IPAs. Furthermore, we integrate InvAASTCluster into a state-of-the-art clustering-based program repair tool. Our results show that InvAASTCluster advances the current state-of-the-art when used by clustering-based repair tools by repairing around 13% more students' programs, in a shorter amount of time.
