Evaluating Front-end & Back-end of Human Automation Interaction Applications Δ-EVAL A Hypothetical Benchmark
Gonçalo Hora de Carvalho
TL;DR
Δ-EVAL addresses the need for a holistic benchmark for Human-Automation Interaction by unifying front-end user interfaces and back-end automation evaluation through cognitively grounded metrics. It systematically translates cognitive engineering models into evaluative constructs (e.g., PC, SDT, NDM, CCT, Lens Model) and proposes a suite of interrelated metrics (CIB, OP, ASE, NI, IR, CSI, FE, OL, CRI, etc.) to quantify operator-system interactions. The framework emphasizes iterative design, standardization, and the potential to leverage causal graphs for identifying intervention effects, aiming to improve safety and performance in high-stakes domains. While largely conceptual, Δ-EVAL provides a modular blueprint for empirical validation, dataset design, and integration with AI benchmarking advances to guide future development of human-centric, reproducible HAI evaluations.
Abstract
Human Factors, Cognitive Engineering, and Human-Automation Interaction (HAI) form a trifecta, where users and technological systems of ever increasing autonomous control occupy a centre position. But with great autonomy comes great responsibility. It is in this context that we propose metrics and a benchmark framework based on known regimes in Artificial Intelligence (AI). A benchmark is a set of tests and metrics or measurements conducted on those tests or tasks. We hypothesise about possible tasks designed to assess operator-system interactions and both the front-end and back-end components of HAI applications. Here, front-end pertains to the user interface and direct interactions the user has with a system, while the back-end is composed of the underlying processes and mechanisms that support the front-end experience. By evaluating HAI systems through the proposed metrics, based on Cognitive Engineering studies of judgment and prediction, we attempt to unify many known taxonomies and design guidelines for HAI systems in a benchmark. This is facilitated by providing a structured approach to quantifying the efficacy and reliability of these systems in a formal way inspired by the recent fast developments in AI benchmarking techniques, thus, we attempt to guide designing principles towards a testable benchmark capable of reproducible results that is future-proof, general, and insightful both in the cognitive and technological stacks of any HAI application.
