STL: Still Tricky Logic (for System Validation, Even When Showing Your Work)

Isabelle Hurley; Rohan Paleja; Ashley Suh; Jaime D. Peña; Ho Chit Siu

STL: Still Tricky Logic (for System Validation, Even When Showing Your Work)

Isabelle Hurley, Rohan Paleja, Ashley Suh, Jaime D. Peña, Ho Chit Siu

TL;DR

The paper addresses the challenge of validating policies specified in Signal Temporal Logic (STL) by humans, questioning the assumed interpretability of formal methods. It designs ManeuverGame, a grid-world testbed, and evaluates three conditions—no active learning, active learning without feedback, and active learning with feedback—on human validation performance. Across 55 participants and ten tasks each, the study finds an overall accuracy of $65\% \pm 16\%$ with no significant advantage from active-learning interventions, suggesting that STL-based specifications remain challenging for human interpretability in validation. The work highlights the need for clearer definitions of interpretability, edge-case exposure strategies, and multi-session studies to advance human-centric validation of autonomous systems.

Abstract

As learned control policies become increasingly common in autonomous systems, there is increasing need to ensure that they are interpretable and can be checked by human stakeholders. Formal specifications have been proposed as ways to produce human-interpretable policies for autonomous systems that can still be learned from examples. Previous work showed that despite claims of interpretability, humans are unable to use formal specifications presented in a variety of ways to validate even simple robot behaviors. This work uses active learning, a standard pedagogical method, to attempt to improve humans' ability to validate policies in signal temporal logic (STL). Results show that overall validation accuracy is not high, at $65\% \pm 15\%$ (mean $\pm$ standard deviation), and that the three conditions of no active learning, active learning, and active learning with feedback do not significantly differ from each other. Our results suggest that the utility of formal specifications for human interpretability is still unsupported but point to other avenues of development which may enable improvements in system validation.

STL: Still Tricky Logic (for System Validation, Even When Showing Your Work)

TL;DR

with no significant advantage from active-learning interventions, suggesting that STL-based specifications remain challenging for human interpretability in validation. The work highlights the need for clearer definitions of interpretability, edge-case exposure strategies, and multi-session studies to advance human-centric validation of autonomous systems.

Abstract

(mean

standard deviation), and that the three conditions of no active learning, active learning, and active learning with feedback do not significantly differ from each other. Our results suggest that the utility of formal specifications for human interpretability is still unsupported but point to other avenues of development which may enable improvements in system validation.

Paper Structure (28 sections, 1 equation, 7 figures, 2 tables)

This paper contains 28 sections, 1 equation, 7 figures, 2 tables.

Introduction
Background
Formal Methods for Interpretable Learning
Human Active Learning for Validation
Methods
Experimental Testbed: ManeuverGame
Active Learning For Validation Concept
Human-Subjects Study
Procedure
Experiment Hypotheses
Results
Validation Task Performance
Effective Engagement
Trajectory Analysis
Subject Cognitive Process
...and 13 more sections

Figures (7)

Figure 1: Bloom's taxonomy forehand2010bloom applied to ManeuverGame. For a given problem, a subject must first recall the meaning of the information they're being presented with (e.g., a formal specification), and understand it in the given context (grid world). Applying these concepts in ManeuverGame enables the subject to analyze different trajectories, both valid and invalid, under a trial-and-error process. Evaluating the trajectories, a subject is able to hone in on specifications that are both valid and meet the specification, allowing them to create multiple such specifications. This process repeats itself for new problems.
Figure 2: STEM experience vs. validation score by experiment condition. Horizontal jitter was added to visually separate points. Linear fit is shown. Displayed $r$ and $p$ values were calculated using Spearman's coefficient.
Figure 3: Combined heat map and boxplot (same data) showing satisfaction of specification in trajectory generation by correctness for valid and invalid ground truth. Each datapoint represents one response to a validation satisfaction question.
Figure 4: Formal Methods Familiarity vs. Validation Score. Horizontal jitter was added to visually separate points. Displayed $r$ and $p$ values were calculated using Spearman's coefficient but a linear fit is shown.
Figure 5: Participant confidence in their answer when their answer was actually correct vs incorrect, split by experimental condition.
...and 2 more figures

STL: Still Tricky Logic (for System Validation, Even When Showing Your Work)

TL;DR

Abstract

STL: Still Tricky Logic (for System Validation, Even When Showing Your Work)

Authors

TL;DR

Abstract

Table of Contents

Figures (7)