From Metrics to Meaning: Time to Rethink Evaluation in Human-AI Collaborative Design

Sean P. Walton; Ben J. Evans; Alma A. M. Rahat; James Stovold; Jakub Vincalek

From Metrics to Meaning: Time to Rethink Evaluation in Human-AI Collaborative Design

Sean P. Walton, Ben J. Evans, Alma A. M. Rahat, James Stovold, Jakub Vincalek

TL;DR

This work targets the evaluation gap in human–AI collaborative design by showing that galleries of AI-generated design suggestions—especially those produced by MAP–Elites—increase user engagement and can improve design outcomes. Through a large field study (n=808) and a controlled lab study (n=12) of The Genetic Car Designer, the authors demonstrate that simply viewing gallery suggestions shapes cognitive, behavioral, and emotional engagement, and that engagement correlates with design quality, though not in a simple, one-to-one manner. They argue that evaluation should be holistic, treating intelligent systems as integral to the user experience rather than mere back-end tools, and propose adaptive, citizen-science–friendly, and trust-building directions for future human–AI collaborative environments. The findings have practical implications for designing gallery-based creativity tools, suggesting that diversity and transparency in AI-generated exemplars can empower designers to explore more effectively and reach higher-quality outcomes.

Abstract

As AI systems increasingly shape decision making in creative design contexts, understanding how humans engage with these tools has become a critical challenge for interactive intelligent systems research. This paper contributes a challenge to rethink how to evaluate human--AI collaborative systems, advocating for a more nuanced and multidimensional approach. Findings from one of the largest field studies to date (n = 808) of a human--AI co-creative system, The Genetic Car Designer, complemented by a controlled lab study (n = 12) are presented. The system is based on an interactive evolutionary algorithm where participants were tasked with designing a simple two dimensional representation of a car. Participants were exposed to galleries of design suggestions generated by an intelligent system, MAP--Elites, and a random control. Results indicate that exposure to galleries generated by MAP--Elites significantly enhanced both cognitive and behavioural engagement, leading to higher-quality design outcomes. Crucially for the wider community, the analysis reveals that conventional evaluation methods, which often focus on solely behavioural and design quality metrics, fail to capture the full spectrum of user engagement. By considering the human--AI design process as a changing emotional, behavioural and cognitive state of the designer, we propose evaluating human--AI systems holistically and considering intelligent systems as a core part of the user experience -- not simply a back end tool.

From Metrics to Meaning: Time to Rethink Evaluation in Human-AI Collaborative Design

TL;DR

Abstract

Paper Structure (66 sections, 12 figures, 12 tables)

This paper contains 66 sections, 12 figures, 12 tables.

Introduction
Related Work
Galleries of Examples Support Creativity
Galleries of Examples can be Generated by Algorithms
Methods are Required to Select the Set of Examples to Show the Designer
Our Understanding of how the Method of Selection Affects the Human Experience is Limited
Methodology
The Genetic Car Designer
The Design Task
Mixed-Initiative Evolutionary Algorithm
The Live Views
The Gallery Views
The Editor View
Experimental Approach
Defining Engagement
...and 51 more sections

Figures (12)

Figure 1: The user journey when first launching the genetic car designer. Firstly, the user is asked if they consent to their data being part of the research (A), then the user selects the course or level they wish design a car for along with the number of design dimension they will have (B), they are then given a design brief (C) before being presented with the live view (D). In the main view there is a toolbar at the top of the screen to allow the user to navigate through alternate views.
Figure 2: An example of one of the courses participants can select from when starting the task. Cars are dropped into the course at the far left and simulated for 30 seconds. The quality of a design is then measured by the signed distance travelled along the horizontal axis from the designs initial contact point with the ground and the final resting point. The colours of each wheel and the car's body is mapped to the mass of each component. Figure colours have been inverted for clarity.
Figure 3: A flow diagram illustrating the optimisation algorithm and the decision making roles of the human and the algorithm. Here the algorithmic decision of which designs to populate the galleries with is highlighted since it is the core focus of our study.
Figure 4: The default live view which shows each design in the current generation as they are being evaluated. From here the user can select which designs to use to create the next generation and which to test in the next generation. They can also click edit to edit any design in the edit view. This is a real time camera showing the actual position of the design in the current simulation. Figure colours have been inverted for clarity.
Figure 5: An example of a Gallery View. This is showing the Speed Insights, note that there are three blank gallery thumbnails which indicates part of the search space where designs have not been tested. Unlike the live view, the thumbnails here are static images from the last time step in the simulation where these designs were tested. Figure colours have been inverted for clarity.
...and 7 more figures

From Metrics to Meaning: Time to Rethink Evaluation in Human-AI Collaborative Design

TL;DR

Abstract

From Metrics to Meaning: Time to Rethink Evaluation in Human-AI Collaborative Design

Authors

TL;DR

Abstract

Table of Contents

Figures (12)