Table of Contents
Fetching ...

On Enhancing Root Cause Analysis with SQL Summaries for Failures in Database Workload Replays at SAP HANA

Neetha Jambigi, Joshua Hammesfahr, Moritz Mueller, Thomas Bach, Michael Felderer

TL;DR

The paper tackles false positives in replay-based regression testing for SAP HANA by extending MIRA with an LLM-generated SQL failure summary feature to provide contextualized failure information. It introduces a new, labeled training dataset (25,453 events, 162 classes) and shows that incorporating summarized failed SQL statements improves F1-Macro by $4.77 ext{\%}$, while enhancing interpretability for operators. By addressing data-evolution challenges and overlapping feature space, the work demonstrates practical gains in robustness and user insight, offering a viable long-term strategy that blends automated summaries with human-in-the-loop validation. The approach highlights the value of retrieval- and context-aware natural language tooling in software testing pipelines for complex database systems.

Abstract

Capturing the workload of a database and replaying this workload for a new version of the database can be an effective approach for regression testing. However, false positive errors caused by many factors such as data privacy limitations, time dependency or non-determinism in multi-threaded environment can negatively impact the effectiveness. Therefore, we employ a machine learning based framework to automate the root cause analysis of failures found during replays. However, handling unseen novel issues not found in the training data is one general challenge of machine learning approaches with respect to generalizability of the learned model. We describe how we continue to address this challenge for more robust long-term solutions. From our experience, retraining with new failures is inadequate due to features overlapping across distinct root causes. Hence, we leverage a large language model (LLM) to analyze failed SQL statements and extract concise failure summaries as an additional feature to enhance the classification process. Our experiments show the F1-Macro score improved by 4.77% for our data. We consider our approach beneficial for providing end users with additional information to gain more insights into the found issues and to improve the assessment of the replay results.

On Enhancing Root Cause Analysis with SQL Summaries for Failures in Database Workload Replays at SAP HANA

TL;DR

The paper tackles false positives in replay-based regression testing for SAP HANA by extending MIRA with an LLM-generated SQL failure summary feature to provide contextualized failure information. It introduces a new, labeled training dataset (25,453 events, 162 classes) and shows that incorporating summarized failed SQL statements improves F1-Macro by , while enhancing interpretability for operators. By addressing data-evolution challenges and overlapping feature space, the work demonstrates practical gains in robustness and user insight, offering a viable long-term strategy that blends automated summaries with human-in-the-loop validation. The approach highlights the value of retrieval- and context-aware natural language tooling in software testing pipelines for complex database systems.

Abstract

Capturing the workload of a database and replaying this workload for a new version of the database can be an effective approach for regression testing. However, false positive errors caused by many factors such as data privacy limitations, time dependency or non-determinism in multi-threaded environment can negatively impact the effectiveness. Therefore, we employ a machine learning based framework to automate the root cause analysis of failures found during replays. However, handling unseen novel issues not found in the training data is one general challenge of machine learning approaches with respect to generalizability of the learned model. We describe how we continue to address this challenge for more robust long-term solutions. From our experience, retraining with new failures is inadequate due to features overlapping across distinct root causes. Hence, we leverage a large language model (LLM) to analyze failed SQL statements and extract concise failure summaries as an additional feature to enhance the classification process. Our experiments show the F1-Macro score improved by 4.77% for our data. We consider our approach beneficial for providing end users with additional information to gain more insights into the found issues and to improve the assessment of the replay results.

Paper Structure

This paper contains 11 sections, 4 figures, 4 tables.

Figures (4)

  • Figure 1: Weekly Average User-Review
  • Figure 2: Constructing New Training Data
  • Figure 3: Before
  • Figure 4: After