From Metrics to Meaning: Time to Rethink Evaluation in Human-AI Collaborative Design
Sean P. Walton, Ben J. Evans, Alma A. M. Rahat, James Stovold, Jakub Vincalek
TL;DR
This work targets the evaluation gap in human–AI collaborative design by showing that galleries of AI-generated design suggestions—especially those produced by MAP–Elites—increase user engagement and can improve design outcomes. Through a large field study (n=808) and a controlled lab study (n=12) of The Genetic Car Designer, the authors demonstrate that simply viewing gallery suggestions shapes cognitive, behavioral, and emotional engagement, and that engagement correlates with design quality, though not in a simple, one-to-one manner. They argue that evaluation should be holistic, treating intelligent systems as integral to the user experience rather than mere back-end tools, and propose adaptive, citizen-science–friendly, and trust-building directions for future human–AI collaborative environments. The findings have practical implications for designing gallery-based creativity tools, suggesting that diversity and transparency in AI-generated exemplars can empower designers to explore more effectively and reach higher-quality outcomes.
Abstract
As AI systems increasingly shape decision making in creative design contexts, understanding how humans engage with these tools has become a critical challenge for interactive intelligent systems research. This paper contributes a challenge to rethink how to evaluate human--AI collaborative systems, advocating for a more nuanced and multidimensional approach. Findings from one of the largest field studies to date (n = 808) of a human--AI co-creative system, The Genetic Car Designer, complemented by a controlled lab study (n = 12) are presented. The system is based on an interactive evolutionary algorithm where participants were tasked with designing a simple two dimensional representation of a car. Participants were exposed to galleries of design suggestions generated by an intelligent system, MAP--Elites, and a random control. Results indicate that exposure to galleries generated by MAP--Elites significantly enhanced both cognitive and behavioural engagement, leading to higher-quality design outcomes. Crucially for the wider community, the analysis reveals that conventional evaluation methods, which often focus on solely behavioural and design quality metrics, fail to capture the full spectrum of user engagement. By considering the human--AI design process as a changing emotional, behavioural and cognitive state of the designer, we propose evaluating human--AI systems holistically and considering intelligent systems as a core part of the user experience -- not simply a back end tool.
