Table of Contents
Fetching ...

IRIS: Intent Resolution via Inference-time Saccades for Open-Ended VQA in Large Vision-Language Models

Parsa Madinei, Srijita Karmakar, Russell Cohen Hoffing, Felix Gervitz, Miguel P. Eckstein

TL;DR

It is demonstrated that fixations closest to the time participants start verbally asking their questions are the most informative for disambiguation in Large VLMs, more than doubling the accuracy of responses on ambiguous questions while maintaining performance on unambiguous queries.

Abstract

We introduce IRIS (Intent Resolution via Inference-time Saccades), a novel training-free approach that uses eye-tracking data in real-time to resolve ambiguity in open-ended VQA. Through a comprehensive user study with 500 unique image-question pairs, we demonstrate that fixations closest to the time participants start verbally asking their questions are the most informative for disambiguation in Large VLMs, more than doubling the accuracy of responses on ambiguous questions (from 35.2% to 77.2%) while maintaining performance on unambiguous queries. We evaluate our approach across state-of-the-art VLMs, showing consistent improvements when gaze data is incorporated in ambiguous image-question pairs, regardless of architectural differences. We release a new benchmark dataset to use eye movement data for disambiguated VQA, a novel real-time interactive protocol, and an evaluation suite.

IRIS: Intent Resolution via Inference-time Saccades for Open-Ended VQA in Large Vision-Language Models

TL;DR

It is demonstrated that fixations closest to the time participants start verbally asking their questions are the most informative for disambiguation in Large VLMs, more than doubling the accuracy of responses on ambiguous questions while maintaining performance on unambiguous queries.

Abstract

We introduce IRIS (Intent Resolution via Inference-time Saccades), a novel training-free approach that uses eye-tracking data in real-time to resolve ambiguity in open-ended VQA. Through a comprehensive user study with 500 unique image-question pairs, we demonstrate that fixations closest to the time participants start verbally asking their questions are the most informative for disambiguation in Large VLMs, more than doubling the accuracy of responses on ambiguous questions (from 35.2% to 77.2%) while maintaining performance on unambiguous queries. We evaluate our approach across state-of-the-art VLMs, showing consistent improvements when gaze data is incorporated in ambiguous image-question pairs, regardless of architectural differences. We release a new benchmark dataset to use eye movement data for disambiguated VQA, a novel real-time interactive protocol, and an evaluation suite.
Paper Structure (27 sections, 1 equation, 15 figures, 2 tables)

This paper contains 27 sections, 1 equation, 15 figures, 2 tables.

Figures (15)

  • Figure 1: IRIS overview. Participant asks an ambiguous question about an image while their eyes are being tracked. The VLM uses the fixation data (marked as a white cross) to disambiguate the query and provide an accurate response in real-time.
  • Figure 2: Experimental procedure. A central fixation check was enforced, after which participants freely viewed each image and asked any question aloud about it. Once 1.5s of silence elapsed following the question, the VLM was prompted with (i) the image, (ii) the transcribed question, and (iii) the same image with fixation data superimposed. Finally, participants reported the object they queried about (location of interest) by clicking the corresponding region of the image.
  • Figure 3: Qualitative results showing successful (A) and failed (B) disambiguation using gaze data. Black circles with white crosses mark temporally and spatially filtered fixation locations.
  • Figure 4: Temporal-spatial filtering of eye gaze data. Fixations colored by time. Black lines on the color bar mark speech onset and end. Diamonds represent fixations within ±1s of speech onset; all others are circles. Any diamond within 2 dva of the diamonds’ median (red +) is spatially filtered and rendered as white crosses on black circles and passed onto IRIS.
  • Figure 5: Temporal dynamics of gaze informativeness for ambiguous questions. (A) Model performance increases as the temporal window expands around speech onset time, converging to the "all-fixations" baseline similarity at ±4500ms around speech onset. The bottom panel shows decreasing distance between the fixation median and the Location of Interest (LOI) in dark blue and increasing fixation count with larger windows in red. (B) Peak performance a few milliseconds before speech onset is revealed in sliding window analysis (600ms window-width, 400ms sliding step-size), aligning with minimum fixation-to-LOI distance. Gray shaded region indicates the interquartile range of speech end times. Error bars represent SEM. (A) and (B) top: green - "LOI (perfect gaze)" upper bound; orange - "all-fixations" baseline; purple - "image-only" baseline; pink - "wrong answer" lower bound (see Section \ref{['method:bounds']} for details).
  • ...and 10 more figures