Tokenization of Gaze Data
Tim Rolff, Jurik Karimian, Niklas Hypki, Susanne Schmidt, Markus Lappe, Frank Steinicke
TL;DR
This work introduces a structured evaluation of five gaze data tokenizers—Binary, μ-law, Quantile, k-Means, and VQ-VAE—across three egocentric gaze datasets to enable processing by large language models. By assessing reconstruction, compression, forecasting, and generation with GPT-2 fine-tuning, the study reveals that quantile tokenization best supports gaze position forecasting while k-Means excels for velocity predictions; VQ-VAE offers strong versatility given sufficient data. The authors provide a fast Rust/Python framework for tokenization and derive practical guidance: use quantile on small datasets and VQ-VAE when data scale permits, with velocity tasks favoring k-Means or VQ-VAE. These insights advance multimodal modeling by enabling gaze data to leverage pretrained LLMs, potentially unlocking downstream applications in attention-aware systems, while underscoring the need for careful consideration of privacy and societal impacts.
Abstract
A considerable part of the performance of today's large language models (LLM's) and multimodal large language models (MLLM's) depends on their tokenization strategies. While tokenizers are extensively researched for textual and visual input, there is no research on tokenization strategies for gaze data due to its nature. However, a corresponding tokenization strategy would allow using the vision capabilities of pre-trained MLLM's for gaze data, for example, through fine-tuning. In this paper, we aim to close this research gap by analyzing five different tokenizers for gaze data on three different datasets for the forecasting and generation of gaze data through LLMs (cf.~\cref{fig:teaser}). We evaluate the tokenizers regarding their reconstruction and compression abilities. Further, we train an LLM for each tokenization strategy, measuring its generative and predictive performance. Overall, we found that a quantile tokenizer outperforms all others in predicting the gaze positions and k-means is best when predicting gaze velocities.
