Table of Contents
Fetching ...

Peeking Ahead of the Field Study: Exploring VLM Personas as Support Tools for Embodied Studies in HCI

Xinyue Gui, Ding Xia, Mark Colley, Yuan Li, Vishal Chauhan, Anubhav Anubhav, Zhongyi Zhou, Ehsan Javanmardi, Stela Hanbyeol Seo, Chia-Ming Chang, Manabu Tsukada, Takeo Igarashi

TL;DR

A fast, low-cost evaluation method using Vision-Language Model (VLM) personas to simulate outcomes comparable to field results, which shows promise for formative studies, field study preparation, and human data augmentation.

Abstract

Field studies are irreplaceable but costly, time-consuming, and error-prone, which need careful preparation. Inspired by rapid-prototyping in manufacturing, we propose a fast, low-cost evaluation method using Vision-Language Model (VLM) personas to simulate outcomes comparable to field results. While LLMs show human-like reasoning and language capabilities, autonomous vehicle (AV)-pedestrian interaction requires spatial awareness, emotional empathy, and behavioral generation. This raises our research question: To what extent can VLM personas mimic human responses in field studies? We conducted parallel studies: 1) one real-world study with 20 participants, and 2) one video-study using 20 VLM personas, both on a street-crossing task. We compared their responses and interviewed five HCI researchers on potential applications. Results show that VLM personas mimic human response patterns (e.g., average crossing times of 5.25 s vs. 5.07 s) lack the behavioral variability and depth. They show promise for formative studies, field study preparation, and human data augmentation.

Peeking Ahead of the Field Study: Exploring VLM Personas as Support Tools for Embodied Studies in HCI

TL;DR

A fast, low-cost evaluation method using Vision-Language Model (VLM) personas to simulate outcomes comparable to field results, which shows promise for formative studies, field study preparation, and human data augmentation.

Abstract

Field studies are irreplaceable but costly, time-consuming, and error-prone, which need careful preparation. Inspired by rapid-prototyping in manufacturing, we propose a fast, low-cost evaluation method using Vision-Language Model (VLM) personas to simulate outcomes comparable to field results. While LLMs show human-like reasoning and language capabilities, autonomous vehicle (AV)-pedestrian interaction requires spatial awareness, emotional empathy, and behavioral generation. This raises our research question: To what extent can VLM personas mimic human responses in field studies? We conducted parallel studies: 1) one real-world study with 20 participants, and 2) one video-study using 20 VLM personas, both on a street-crossing task. We compared their responses and interviewed five HCI researchers on potential applications. Results show that VLM personas mimic human response patterns (e.g., average crossing times of 5.25 s vs. 5.07 s) lack the behavioral variability and depth. They show promise for formative studies, field study preparation, and human data augmentation.
Paper Structure (39 sections, 12 figures)

This paper contains 39 sections, 12 figures.

Figures (12)

  • Figure 1: The eHMI prototype: The AV has two behavior options: 1) continue driving without yielding (the first row), and 2) stop with a yielding intention (the second row). There are three eHMI conditions: 1) light strip (the first column), 2) eyes (the second column), and 3) no eHMI as baseline (the third column).
  • Figure 2: The AV autonomous mode is shown in two options (stop or pass): the left two images display the actual stopping point from the field study, and the right two show the simulation interface in Autoware.
  • Figure 3: Overview of the field study design, including experimenter roles and critical points.
  • Figure 4: The top row shows the five key recording points (green crosses) where the video was captured. The bottom part illustrates an example trajectory generated by the VLM. In the first column, the simulator begins playing the video recorded at position 0. If the VLM chooses 'forward,' the next column shows the video from position 1, one second later. If the VLM chooses 'stop' (e.g., in the fourth column), the simulator continues playing the video from the same position.
  • Figure 5: Procedure for a VLM persona answering the questionnaire. It first reviews a simulated memory from an interaction, giving the rating answer, printing its reasoning for that answer, and repeating this entire cycle for all six conditions.
  • ...and 7 more figures