Quantifying Semantic Query Similarity for Automated Linear SQL Grading: A Graph-based Approach

Leo Köberlein; Dominik Probst; Richard Lenz

Quantifying Semantic Query Similarity for Automated Linear SQL Grading: A Graph-based Approach

Leo Köberlein, Dominik Probst, Richard Lenz

TL;DR

This paper presents a graph-based framework to quantify semantic similarity between SQL queries for automated grading. Queries are modeled as nodes in an implicit graph, with edits serving as edges; the cheapest edit sequence between two queries yields a semantic distance, enabling partial feedback and a linear scoring scheme. A prototype with numerous edits demonstrates feasibility, showing competitive fairness and comprehensibility against manual grading and dynamic analysis in a user survey. The approach is extensible, supports incomplete/non-executable ASTs, and can be adapted to other domains or languages, offering a scalable path toward richer, semantically-informed query assessment.

Abstract

Quantifying the semantic similarity between database queries is a critical challenge with broad applications, ranging from query log analysis to automated educational assessment of SQL skills. Traditional methods often rely solely on syntactic comparisons or are limited to checking for semantic equivalence. This paper introduces a novel graph-based approach to measure the semantic dissimilarity between SQL queries. Queries are represented as nodes in an implicit graph, while the transitions between nodes are called edits, which are weighted by semantic dissimilarity. We employ shortest path algorithms to identify the lowest-cost edit sequence between two given queries, thereby defining a quantifiable measure of semantic distance. A prototype implementation of this technique has been evaluated through an empirical study, which strongly suggests that our method provides more accurate and comprehensible grading compared to existing techniques. Moreover, the results indicate that our approach comes close to the quality of manual grading, making it a robust tool for diverse database query comparison tasks.

Quantifying Semantic Query Similarity for Automated Linear SQL Grading: A Graph-based Approach

TL;DR

Abstract

Paper Structure (36 sections, 1 equation, 10 figures, 1 table, 1 algorithm)

This paper contains 36 sections, 1 equation, 10 figures, 1 table, 1 algorithm.

introduction
Goals
Related Works
Dynamic analysis
Static analysis for equivalence
Static analysis for similarity
Concept
Definitions
Idea
Nodes
Syntactic and semantic differences
Non-executability
Node definition and comparison
Edits
Edits vs. edges
...and 21 more sections

Figures (10)

Figure 1: Two subsequent applications of the edit with cost 2 called setDistinct on the start query.
Figure 2: Multiple neighbors from one application of addSelectColumnReference.
Figure 3: Possible paths from the start query to the destination query.
Figure 4: One edit representing multiple outgoing edges.
Figure 5: Two atomic edits with combined cost 2 vs. one shortcut edit with cost 1.
...and 5 more figures

Theorems & Definitions (4)

definition 1: Parsable
definition 2: Executable
definition 3: Syntactically equivalent
definition 4: Semantically equivalent

Quantifying Semantic Query Similarity for Automated Linear SQL Grading: A Graph-based Approach

TL;DR

Abstract

Quantifying Semantic Query Similarity for Automated Linear SQL Grading: A Graph-based Approach

Authors

TL;DR

Abstract

Table of Contents

Figures (10)

Theorems & Definitions (4)