Table of Contents
Fetching ...

InvAASTCluster: On Applying Invariant-Based Program Clustering to Introductory Programming Assignments

Pedro Orvalho, Mikoláš Janota, Vasco Manquinho

TL;DR

Inv-AASTCluster introduces a semantic-syntactic IPA clustering approach by combining dynamically inferred invariants with anonymized ASTs (AASTs). It encodes programs via Bag-of-Words from invariants, AASTs, or their combination, and clusters with KMeans using $K=0.1N$, achieving higher semantic separation than syntax-based methods. In experiments on 1620 correct submissions across 25 IPAs, the AAST+Invs representation attains the highest cluster accuracy (approximately $83.6\%$), outperforming invariants-only and syntax-based approaches, and when integrated with Clara, enables repairing around $13\%$ more submissions with faster turnaround. The work provides a modular, open-source framework and suggests future work in applying the representations to deep learning for fault localization and expanding to more complex programs and other programming languages.

Abstract

Due to the vast number of students enrolled in programming courses, there has been an increasing number of automated program repair techniques focused on introductory programming assignments (IPAs). Typically, such techniques use program clustering to take advantage of previous correct student implementations to repair a new incorrect submission. These repair techniques use clustering methods since analyzing all available correct submissions to repair a program is not feasible. However, conventional clustering methods rely on program representations based on features such as abstract syntax trees (ASTs), syntax, control flow, and data flow. This paper proposes InvAASTCluster, a novel approach for program clustering that uses dynamically generated program invariants to cluster semantically equivalent IPAs. InvAASTCluster's program representation uses a combination of the program's semantics, through its invariants, and its structure through its anonymized abstract syntax tree (AASTs). Invariants denote conditions that must remain true during program execution, while AASTs are ASTs devoid of variable and function names, retaining only their types. Our experiments show that the proposed program representation outperforms syntax-based representations when clustering a set of correct IPAs. Furthermore, we integrate InvAASTCluster into a state-of-the-art clustering-based program repair tool. Our results show that InvAASTCluster advances the current state-of-the-art when used by clustering-based repair tools by repairing around 13% more students' programs, in a shorter amount of time.

InvAASTCluster: On Applying Invariant-Based Program Clustering to Introductory Programming Assignments

TL;DR

Inv-AASTCluster introduces a semantic-syntactic IPA clustering approach by combining dynamically inferred invariants with anonymized ASTs (AASTs). It encodes programs via Bag-of-Words from invariants, AASTs, or their combination, and clusters with KMeans using , achieving higher semantic separation than syntax-based methods. In experiments on 1620 correct submissions across 25 IPAs, the AAST+Invs representation attains the highest cluster accuracy (approximately ), outperforming invariants-only and syntax-based approaches, and when integrated with Clara, enables repairing around more submissions with faster turnaround. The work provides a modular, open-source framework and suggests future work in applying the representations to deep learning for fault localization and expanding to more complex programs and other programming languages.

Abstract

Due to the vast number of students enrolled in programming courses, there has been an increasing number of automated program repair techniques focused on introductory programming assignments (IPAs). Typically, such techniques use program clustering to take advantage of previous correct student implementations to repair a new incorrect submission. These repair techniques use clustering methods since analyzing all available correct submissions to repair a program is not feasible. However, conventional clustering methods rely on program representations based on features such as abstract syntax trees (ASTs), syntax, control flow, and data flow. This paper proposes InvAASTCluster, a novel approach for program clustering that uses dynamically generated program invariants to cluster semantically equivalent IPAs. InvAASTCluster's program representation uses a combination of the program's semantics, through its invariants, and its structure through its anonymized abstract syntax tree (AASTs). Invariants denote conditions that must remain true during program execution, while AASTs are ASTs devoid of variable and function names, retaining only their types. Our experiments show that the proposed program representation outperforms syntax-based representations when clustering a set of correct IPAs. Furthermore, we integrate InvAASTCluster into a state-of-the-art clustering-based program repair tool. Our results show that InvAASTCluster advances the current state-of-the-art when used by clustering-based repair tools by repairing around 13% more students' programs, in a shorter amount of time.
Paper Structure (74 sections, 3 equations, 21 figures, 5 tables)

This paper contains 74 sections, 3 equations, 21 figures, 5 tables.

Figures (21)

  • Figure 1: Clustering-based Program Repair.
  • Figure 2: A small example of an AST and an AAST for the variable declaration, int i. An integer variable with identifier $i$.
  • Figure 3: The high-level overview of Inv-AAST-Cluster.
  • Figure 4: Finding the closest correct program, i.e., the closest correct program representative to the incorrect submission vector representation. This approach passes only one program to the repair tool instead of $K$ programs.
  • Figure 5: Comparison between the ground truth (on the right) and the clusters and cluster accuracy obtained using the KMeans algorithm (on the left) for each type of program representation.
  • ...and 16 more figures

Theorems & Definitions (11)

  • Example 1
  • Definition 3.1: Context-free Grammar (CFG)
  • Definition 3.2: Domain-Specific Language (DSL)
  • Definition 3.3: Abstract Syntax Tree (AST)
  • Definition 3.4: Anonymized Abstract Syntax Tree (AAST)
  • Definition 3.5: Program Invariant
  • Definition 3.6: Bag of Words (BoW)
  • Example 2
  • Example 3
  • Example 4
  • ...and 1 more