Voila-A: Aligning Vision-Language Models with User's Gaze Attention

Kun Yan; Lei Ji; Zeyu Wang; Yuntao Wang; Nan Duan; Shuai Ma

Voila-A: Aligning Vision-Language Models with User's Gaze Attention

Kun Yan, Lei Ji, Zeyu Wang, Yuntao Wang, Nan Duan, Shuai Ma

TL;DR

Voila-A addresses the misalignment between Vision-Language Model attention and human gaze in complex scenes by leveraging gaze data from AR/VR and trace-based proxies. It introduces VOILA-COCO and VOILA-GAZE datasets generated through a GPT-4–driven annotation pipeline and a gaze-informed Voila Perceiver Resampler that preserves pretrained knowledge. Empirical results on VOILA-COCO and VOILA-GAZE demonstrate improved alignment between model attention and human gaze, with superior grounding and helpfulness compared to baselines. This work enables more intuitive, gaze-guided human-AI interactions in real-world AR/VR scenarios and provides reproducible resources for further research.

Abstract

In recent years, the integration of vision and language understanding has led to significant advancements in artificial intelligence, particularly through Vision-Language Models (VLMs). However, existing VLMs face challenges in handling real-world applications with complex scenes and multiple objects, as well as aligning their focus with the diverse attention patterns of human users. In this paper, we introduce gaze information, feasibly collected by AR or VR devices, as a proxy for human attention to guide VLMs and propose a novel approach, Voila-A, for gaze alignment to enhance the interpretability and effectiveness of these models in real-world applications. First, we collect hundreds of minutes of gaze data to demonstrate that we can mimic human gaze modalities using localized narratives. We then design an automatic data annotation pipeline utilizing GPT-4 to generate the VOILA-COCO dataset. Additionally, we innovate the Voila Perceiver modules to integrate gaze information into VLMs while preserving their pretrained knowledge. We evaluate Voila-A using a hold-out validation set and a newly collected VOILA-GAZE Testset, which features real-life scenarios captured with a gaze-tracking device. Our experimental results demonstrate that Voila-A significantly outperforms several baseline models. By aligning model attention with human gaze patterns, Voila-A paves the way for more intuitive, user-centric VLMs and fosters engaging human-AI interaction across a wide range of applications.

Voila-A: Aligning Vision-Language Models with User's Gaze Attention

TL;DR

Abstract

Paper Structure (35 sections, 6 equations, 13 figures, 5 tables)

This paper contains 35 sections, 6 equations, 13 figures, 5 tables.

Introduction
Bridging the Gap in Daily Life Usage of Current VLMs through Gaze Integration
Leveraging Trace Data as an Alternative Approach to Align VLMs with Gaze Attention
Method
Automatic Data Annotation For LN-COCO
VOILA-GAZE: Real-life gaze-QA pairs
Model Design
Training
Experiment
Evaluation metrics
GPT-4 RANKING
Reward Score
Main Results
VOILA Exhibits a Balanced Capability Between Helpfulness and Fact Grounding
Ablation studies
...and 20 more sections

Figures (13)

Figure 1: AR and VR scenarios usually involve complex scenes with multiple objects. Users may interested in only one specific object and gaze is the most natural way to interact with the device.
Figure 2: EMD between the mean heatmaps of 1k gaze and trace samples with varying sampling rates.
Figure 3: Automatic Data Annotation Pipeline
Figure 4: Overall Model Structure
Figure 5: GPT-RANKING ON VOILA-COCO-Testset
...and 8 more figures

Voila-A: Aligning Vision-Language Models with User's Gaze Attention

TL;DR

Abstract

Voila-A: Aligning Vision-Language Models with User's Gaze Attention

Authors

TL;DR

Abstract

Table of Contents

Figures (13)