Table of Contents
Fetching ...

Reading Recognition in the Wild

Charig Yang, Samiul Alam, Shakhrul Iman Siam, Michael J. Proulx, Lambert Mathias, Kiran Somasundaram, Luis Pesqueira, James Fort, Sheroze Sheriffdeen, Omkar Parkhi, Carl Ren, Mi Zhang, Yuning Chai, Richard Newcombe, Hyo Jin Kim

TL;DR

This work defines reading recognition in the wild for wearable glasses and introduces the Reading in the Wild dataset, a large-scale multimodal resource capturing RGB, eye gaze, and head pose signals during diverse reading and non-reading activities. It proposes a lightweight multimodal transformer that can operate with any subset of modalities to detect reading in real time, achieving up to roughly 86.9% accuracy when all signals are used and demonstrating strong generalization capabilities. The dataset enables exploration of reading modes and media in unconstrained settings and shows potential for targeted OCR activation, reducing compute and bandwidth, with on-device deployment possible on current smart glasses. Overall, the paper provides a practical, privacy-aware pathway to contextually aware AI on wearables and opens avenues for broader reading understanding in the wild.

Abstract

To enable egocentric contextual AI in always-on smart glasses, it is crucial to be able to keep a record of the user's interactions with the world, including during reading. In this paper, we introduce a new task of reading recognition to determine when the user is reading. We first introduce the first-of-its-kind large-scale multimodal Reading in the Wild dataset, containing 100 hours of reading and non-reading videos in diverse and realistic scenarios. We then identify three modalities (egocentric RGB, eye gaze, head pose) that can be used to solve the task, and present a flexible transformer model that performs the task using these modalities, either individually or combined. We show that these modalities are relevant and complementary to the task, and investigate how to efficiently and effectively encode each modality. Additionally, we show the usefulness of this dataset towards classifying types of reading, extending current reading understanding studies conducted in constrained settings to larger scale, diversity and realism.

Reading Recognition in the Wild

TL;DR

This work defines reading recognition in the wild for wearable glasses and introduces the Reading in the Wild dataset, a large-scale multimodal resource capturing RGB, eye gaze, and head pose signals during diverse reading and non-reading activities. It proposes a lightweight multimodal transformer that can operate with any subset of modalities to detect reading in real time, achieving up to roughly 86.9% accuracy when all signals are used and demonstrating strong generalization capabilities. The dataset enables exploration of reading modes and media in unconstrained settings and shows potential for targeted OCR activation, reducing compute and bandwidth, with on-device deployment possible on current smart glasses. Overall, the paper provides a practical, privacy-aware pathway to contextually aware AI on wearables and opens avenues for broader reading understanding in the wild.

Abstract

To enable egocentric contextual AI in always-on smart glasses, it is crucial to be able to keep a record of the user's interactions with the world, including during reading. In this paper, we introduce a new task of reading recognition to determine when the user is reading. We first introduce the first-of-its-kind large-scale multimodal Reading in the Wild dataset, containing 100 hours of reading and non-reading videos in diverse and realistic scenarios. We then identify three modalities (egocentric RGB, eye gaze, head pose) that can be used to solve the task, and present a flexible transformer model that performs the task using these modalities, either individually or combined. We show that these modalities are relevant and complementary to the task, and investigate how to efficiently and effectively encode each modality. Additionally, we show the usefulness of this dataset towards classifying types of reading, extending current reading understanding studies conducted in constrained settings to larger scale, diversity and realism.

Paper Structure

This paper contains 25 sections, 8 figures, 6 tables.

Figures (8)

  • Figure 1: Am I reading? The left figure shows a timeline as the user navigates the world. We aim to solve the task of reading recognition to enable AI assistants in always-on wearables. We identify three modalities: eye gaze (in colored dot patterns), RGB crop around gaze (in red box), and inertial sensors performs the task to high accuracy (with Prediction and GT shown). Images from our Reading in the Wild dataset, which features 100 hours of diverse reading and non-reading activities in real-world settings, with examples shown in the right.
  • Figure 2: Comparison to existing datasets. Our dataset is the first reading dataset that contains high-frequency eye-gaze, diverse and realistic egocentric videos, and hard negative (HN) samples.
  • Figure 3: Complementary modalities. Example success and failure cases for gaze and RGB, suggesting the benefit of multimodality.
  • Figure 4: Model architecture. Our model is a simple transformer encoder with any combination of gaze, RGB, and IMU as input.
  • Figure 5: Main results and visualizations. We show the results on Seattle (test set). (a) Our method performs the task to good accuracy, and combining all modalities yields the best results. Metrics are accuracy and F1 score at 0.5 threshold, and precision at 0.9 recall. (b) We show: (i) Col. 1, banal success cases distinguishing reading from daily activities; (ii) Col. 2-4, difficult cases where our combined model predicts correctly even if individual modality fails, including reading from objects, short texts, non-texts, fixation patterns, and hard negatives; (iii) Col. 5, failure cases where all modalities fail, including reading while writing and browsing.
  • ...and 3 more figures