Table of Contents
Fetching ...

Harden and Catch for Just-in-Time Assured LLM-Based Software Testing: Open Research Challenges

Mark Harman, Peter O'Hearn, Shubho Sengupta

TL;DR

This work reframes automated LLM-based software testing around hardening and catching tests, introducing the Just-in-Time JiTTest paradigm to catch regressions and new functionality before production. It provides formal definitions for hardening/catching, distinguishes regression-catching from functionality-catching, and analyzes deployment strategies, including the benefits and risks of engineer involvement. The paper reports initial Meta results with ACH and mutation-guided testing, and presents a comprehensive research agenda, including the Catching JiTTest Challenge and oracle scavenging for non-executable information. Overall, the approach seeks to move beyond the regression-coverage paradigm toward verifiable, timely, and high-signal testing that can reveal latent bugs while reducing false positives. The practical impact lies in enabling more reliable, just-in-time test generation for large-scale LLM-assisted software development and maintenance.

Abstract

Despite decades of research and practice in automated software testing, several fundamental concepts remain ill-defined and under-explored, yet offer enormous potential real-world impact. We show that these concepts raise exciting new challenges in the context of Large Language Models for software test generation. More specifically, we formally define and investigate the properties of hardening and catching tests. A hardening test is one that seeks to protect against future regressions, while a catching test is one that catches such a regression or a fault in new functionality introduced by a code change. Hardening tests can be generated at any time and may become catching tests when a future regression is caught. We also define and motivate the Catching 'Just-in-Time' (JiTTest) Challenge, in which tests are generated 'just-in-time' to catch new faults before they land into production. We show that any solution to Catching JiTTest generation can also be repurposed to catch latent faults in legacy code. We enumerate possible outcomes for hardening and catching tests and JiTTests, and discuss open research problems, deployment options, and initial results from our work on automated LLM-based hardening at Meta. This paper was written to accompany the keynote by the authors at the ACM International Conference on the Foundations of Software Engineering (FSE) 2025. Author order is alphabetical. The corresponding author is Mark Harman.

Harden and Catch for Just-in-Time Assured LLM-Based Software Testing: Open Research Challenges

TL;DR

This work reframes automated LLM-based software testing around hardening and catching tests, introducing the Just-in-Time JiTTest paradigm to catch regressions and new functionality before production. It provides formal definitions for hardening/catching, distinguishes regression-catching from functionality-catching, and analyzes deployment strategies, including the benefits and risks of engineer involvement. The paper reports initial Meta results with ACH and mutation-guided testing, and presents a comprehensive research agenda, including the Catching JiTTest Challenge and oracle scavenging for non-executable information. Overall, the approach seeks to move beyond the regression-coverage paradigm toward verifiable, timely, and high-signal testing that can reveal latent bugs while reducing false positives. The practical impact lies in enabling more reliable, just-in-time test generation for large-scale LLM-assisted software development and maintenance.

Abstract

Despite decades of research and practice in automated software testing, several fundamental concepts remain ill-defined and under-explored, yet offer enormous potential real-world impact. We show that these concepts raise exciting new challenges in the context of Large Language Models for software test generation. More specifically, we formally define and investigate the properties of hardening and catching tests. A hardening test is one that seeks to protect against future regressions, while a catching test is one that catches such a regression or a fault in new functionality introduced by a code change. Hardening tests can be generated at any time and may become catching tests when a future regression is caught. We also define and motivate the Catching 'Just-in-Time' (JiTTest) Challenge, in which tests are generated 'just-in-time' to catch new faults before they land into production. We show that any solution to Catching JiTTest generation can also be repurposed to catch latent faults in legacy code. We enumerate possible outcomes for hardening and catching tests and JiTTests, and discuss open research problems, deployment options, and initial results from our work on automated LLM-based hardening at Meta. This paper was written to accompany the keynote by the authors at the ACM International Conference on the Foundations of Software Engineering (FSE) 2025. Author order is alphabetical. The corresponding author is Mark Harman.

Paper Structure

This paper contains 27 sections, 11 equations, 1 figure, 3 tables.

Figures (1)

  • Figure 1: Hardening and catching Venn Diagram: Strong is a subset of weak. A test can be both (strong/weak) hardening on parent and also (strong/weak) catching on child. Each region of the Venn diagram corresponds to an interesting and important test behaviour.

Theorems & Definitions (21)

  • Definition 1: Real Time, taken from ref mhetal:intense24-keynote
  • Definition 2: Fundamentals
  • Definition 3: Timely Test
  • Definition 4: Just-In-Time Test (JiTTest)
  • Definition 5: Oracle
  • Definition 6: (Weak) Hardening Test
  • Definition 7: (Strong) Hardening Test
  • Example 1: Weak Hardening Test
  • Example 2: Strong Hardening Test
  • Definition 8: (Perfect Precision) Hardening Test
  • ...and 11 more