Table of Contents
Fetching ...

Which Experimental Design is Better Suited for VQA Tasks? Eye Tracking Study on Cognitive Load, Performance, and Gaze Allocations

Sita A. Vriend, Sandeep Vidyapu, Amer Rama, Kun-Ting Chen, Daniel Weiskopf

TL;DR

This study addresses how the order of image stimuli and questions, together with question modality, shape cognitive load, accuracy, and gaze in visual question answering (VQA). It compares five designs using eye-tracking and subjective/cobjective measures (NASA-TLX, accuracy, HAAR, fixation duration) to identify designs that minimize extraneous cognitive burden. Key findings show that the IQ design is most taxing and least accurate, while designs like QI, IQI, and QIQ offer better performance with varied gaze patterns; the auditory AIA design may hinder comprehension. The results provide practical guidance for designing robust visualization experiments and gaze-based studies that rely on VQA tasks.

Abstract

We conducted an eye-tracking user study with 13 participants to investigate the influence of stimulus-question ordering and question modality on participants using visual question-answering (VQA) tasks. We examined cognitive load, task performance, and gaze allocations across five distinct experimental designs, aiming to identify setups that minimize the cognitive burden on participants. The collected performance and gaze data were analyzed using quantitative and qualitative methods. Our results indicate a significant impact of stimulus-question ordering on cognitive load and task performance, as well as a noteworthy effect of question modality on task performance. These findings offer insights for the experimental design of controlled user studies in visualization research.

Which Experimental Design is Better Suited for VQA Tasks? Eye Tracking Study on Cognitive Load, Performance, and Gaze Allocations

TL;DR

This study addresses how the order of image stimuli and questions, together with question modality, shape cognitive load, accuracy, and gaze in visual question answering (VQA). It compares five designs using eye-tracking and subjective/cobjective measures (NASA-TLX, accuracy, HAAR, fixation duration) to identify designs that minimize extraneous cognitive burden. Key findings show that the IQ design is most taxing and least accurate, while designs like QI, IQI, and QIQ offer better performance with varied gaze patterns; the auditory AIA design may hinder comprehension. The results provide practical guidance for designing robust visualization experiments and gaze-based studies that rely on VQA tasks.

Abstract

We conducted an eye-tracking user study with 13 participants to investigate the influence of stimulus-question ordering and question modality on participants using visual question-answering (VQA) tasks. We examined cognitive load, task performance, and gaze allocations across five distinct experimental designs, aiming to identify setups that minimize the cognitive burden on participants. The collected performance and gaze data were analyzed using quantitative and qualitative methods. Our results indicate a significant impact of stimulus-question ordering on cognitive load and task performance, as well as a noteworthy effect of question modality on task performance. These findings offer insights for the experimental design of controlled user studies in visualization research.
Paper Structure (15 sections, 3 figures)

This paper contains 15 sections, 3 figures.

Figures (3)

  • Figure 1: Example of (a) image stimulus, (b) corresponding task (question), and (c) response selection. The correct answer here is "right" because the tower is on the right side; however, the participant selected "left."
  • Figure 2: Violin plots of cognitive load according to NASA-TLX rating (A), task accuracy (B), hit-any-AOI rate per experimental design (C), and mean fixation duration measured in milliseconds (D). The horizontal black line in each plot represents the mean. Significant differences according to post-hoc tests are marked with asterisks (* p < 0.05; ** p < 0.01; *** p < 0.001).
  • Figure 3: Visual scanpath overlaid on images of three study designs of a selected question, where the number and radius indicate the fixation sequence and its duration, respectively. The yellow and red dots indicate the beginning and the end of a scanpath. Aggregated attention is displayed in density maps (d, h, l), while scarf plots (e, i, m) show the fixation duration over a target AOI (colored in light blue) across all participants. The task correctness is shown as green ticks (for correct answers) and red crosses (for incorrect answers). Percentage shows the relative fixation duration spent on the target AOI. Scarf plots in each row are ordered by decreasing relative fixation duration of a target AOI.