Table of Contents
Fetching ...

Tokenization of Gaze Data

Tim Rolff, Jurik Karimian, Niklas Hypki, Susanne Schmidt, Markus Lappe, Frank Steinicke

TL;DR

This work introduces a structured evaluation of five gaze data tokenizers—Binary, μ-law, Quantile, k-Means, and VQ-VAE—across three egocentric gaze datasets to enable processing by large language models. By assessing reconstruction, compression, forecasting, and generation with GPT-2 fine-tuning, the study reveals that quantile tokenization best supports gaze position forecasting while k-Means excels for velocity predictions; VQ-VAE offers strong versatility given sufficient data. The authors provide a fast Rust/Python framework for tokenization and derive practical guidance: use quantile on small datasets and VQ-VAE when data scale permits, with velocity tasks favoring k-Means or VQ-VAE. These insights advance multimodal modeling by enabling gaze data to leverage pretrained LLMs, potentially unlocking downstream applications in attention-aware systems, while underscoring the need for careful consideration of privacy and societal impacts.

Abstract

A considerable part of the performance of today's large language models (LLM's) and multimodal large language models (MLLM's) depends on their tokenization strategies. While tokenizers are extensively researched for textual and visual input, there is no research on tokenization strategies for gaze data due to its nature. However, a corresponding tokenization strategy would allow using the vision capabilities of pre-trained MLLM's for gaze data, for example, through fine-tuning. In this paper, we aim to close this research gap by analyzing five different tokenizers for gaze data on three different datasets for the forecasting and generation of gaze data through LLMs (cf.~\cref{fig:teaser}). We evaluate the tokenizers regarding their reconstruction and compression abilities. Further, we train an LLM for each tokenization strategy, measuring its generative and predictive performance. Overall, we found that a quantile tokenizer outperforms all others in predicting the gaze positions and k-means is best when predicting gaze velocities.

Tokenization of Gaze Data

TL;DR

This work introduces a structured evaluation of five gaze data tokenizers—Binary, μ-law, Quantile, k-Means, and VQ-VAE—across three egocentric gaze datasets to enable processing by large language models. By assessing reconstruction, compression, forecasting, and generation with GPT-2 fine-tuning, the study reveals that quantile tokenization best supports gaze position forecasting while k-Means excels for velocity predictions; VQ-VAE offers strong versatility given sufficient data. The authors provide a fast Rust/Python framework for tokenization and derive practical guidance: use quantile on small datasets and VQ-VAE when data scale permits, with velocity tasks favoring k-Means or VQ-VAE. These insights advance multimodal modeling by enabling gaze data to leverage pretrained LLMs, potentially unlocking downstream applications in attention-aware systems, while underscoring the need for careful consideration of privacy and societal impacts.

Abstract

A considerable part of the performance of today's large language models (LLM's) and multimodal large language models (MLLM's) depends on their tokenization strategies. While tokenizers are extensively researched for textual and visual input, there is no research on tokenization strategies for gaze data due to its nature. However, a corresponding tokenization strategy would allow using the vision capabilities of pre-trained MLLM's for gaze data, for example, through fine-tuning. In this paper, we aim to close this research gap by analyzing five different tokenizers for gaze data on three different datasets for the forecasting and generation of gaze data through LLMs (cf.~\cref{fig:teaser}). We evaluate the tokenizers regarding their reconstruction and compression abilities. Further, we train an LLM for each tokenization strategy, measuring its generative and predictive performance. Overall, we found that a quantile tokenizer outperforms all others in predicting the gaze positions and k-means is best when predicting gaze velocities.

Paper Structure

This paper contains 23 sections, 10 equations, 6 figures, 7 tables.

Figures (6)

  • Figure 1: Accumulative error over the sequence length on the Ego-Exo4D (left), DGaze (middle), and FixationNet (right) datasets.
  • Figure 2: Examplary binary tokenization of the number $\pi$. Note that the number is not an accurate representation of $\pi$ due to IEEE float point inaccuracies.
  • Figure 3: Reconstruction error of the k-means tokenizer plotted against the utilized clusters. Note: the y-axis is logarithmically scaled.
  • Figure 4: The VQ-VAE for the gaze position tokenizer. We utilize a codebook size of 4, resulting in two codebook entries per gaze sample. To generate the tokens, we perform inference of the encoder and use the codebook indices as our tokens.
  • Figure 5: The VQ-VAE for the gaze velocity tokenizer. We utilize a codebook size of 2, resulting in two codebook entries per gaze sample. To generate the tokens, we perform inference of the encoder and use the codebook indices as our tokens.
  • ...and 1 more figures