Enhancing Screen Time Identification in Children with a Multi-View Vision Language Model and Screen Time Tracker

Xinlong Hou; Sen Shen; Xueshen Li; Xinran Gao; Ziyi Huang; Steven J. Holiday; Matthew R. Cribbet; Susan W. White; Edward Sazonov; Yu Gan

Enhancing Screen Time Identification in Children with a Multi-View Vision Language Model and Screen Time Tracker

Xinlong Hou, Sen Shen, Xueshen Li, Xinran Gao, Ziyi Huang, Steven J. Holiday, Matthew R. Cribbet, Susan W. White, Edward Sazonov, Yu Gan

TL;DR

This work addresses the need for objective, non-invasive measurement of children's screen exposure across devices in natural settings. It introduces Screen Time Tracker (STT) wearables and a multi-view Vision Language Model (MV-VLM) that processes egocentric image streams from multiple views, guided by a CLIP-based view-selection module, to generate scene descriptions and identify screen types (TV, Smartphone, Computer) via keyword mapping. The model uses Swin Transformer visual embeddings, MiniLM text embeddings, alignment layers, and Llama2-7B for text generation, with training focused on alignment layers; key results show MV-VLM outperforms baselines, achieving 95.5% accuracy for screen existence and strong screen-type discrimination, with an ablation confirming the necessity of each component. The framework is validated on a free-living, child-centered dataset of 1,800 images from 30 children, demonstrating practical viability, comfort suitability, and potential for integration with behavioral studies to link screen exposure with health outcomes.

Abstract

Being able to accurately monitor the screen exposure of young children is important for research on phenomena linked to screen use such as childhood obesity, physical activity, and social interaction. Most existing studies rely upon self-report or manual measures from bulky wearable sensors, thus lacking efficiency and accuracy in capturing quantitative screen exposure data. In this work, we developed a novel sensor informatics framework that utilizes egocentric images from a wearable sensor, termed the screen time tracker (STT), and a vision language model (VLM). In particular, we devised a multi-view VLM that takes multiple views from egocentric image sequences and interprets screen exposure dynamically. We validated our approach by using a dataset of children's free-living activities, demonstrating significant improvement over existing methods in plain vision language models and object detection models. Results supported the promise of this monitoring approach, which could optimize behavioral research on screen exposure in children's naturalistic settings.

Enhancing Screen Time Identification in Children with a Multi-View Vision Language Model and Screen Time Tracker

TL;DR

Abstract

Paper Structure (21 sections, 2 equations, 7 figures, 4 tables, 1 algorithm)

This paper contains 21 sections, 2 equations, 7 figures, 4 tables, 1 algorithm.

Introduction
Related Work
Method
Wearable sensor and data collection
Framework
Conceptual Rationale for Multi-View Vision Language Model
View Selection
Vision Language Model
Screen Type Identification
Experiments and Results
Dataset
Training strategy for implementation
Multi-View selection
Text generation and screen type identification
Comparison with existing methods
...and 6 more sections

Figures (7)

Figure 1: a) The STT device (left panel); Compilation of images showcasing various environments in screen time exposure. The montage is created using free-living data collected from STT (right panel). The STT device is lightweight and can be firmly attached to clothes. b) The architecture diagram of the proposed MV-VLM. The MV-VLM is designed to process egocentric camera frames collected by a).
Figure 2: Proposed pipeline to process egocentric image stream. There are four major components: view selection, vision language model, language model, and screen identification. View selection module uses Contrastive Language-Image Pre-Training (CLIP) to extract embeddings and select multi-view i mages based on similarity. Vision Langauge model learns from vision transformer and MiniLM to generate textual description on multi-view images.
Figure 3: Details of dataset acquired in this research. a) The overall distribution of different screen types. b) The number of image groups acquired from each subject.
Figure 4: Representative Multi-View images selected by CLIP embedding. Our selection maximize the variations among consecutive frames while capturing complementary features of the screen object from different views. The features are visualized by t-SNE.
Figure 5: Examples of generated text and screen identification. The left panel shows typical Multi-View images. The right panel shows the generated description from language model, the screen identification results, and the annotation. Key words could be efficiently processed to categorize to specific screen type.
...and 2 more figures

Enhancing Screen Time Identification in Children with a Multi-View Vision Language Model and Screen Time Tracker

TL;DR

Abstract

Enhancing Screen Time Identification in Children with a Multi-View Vision Language Model and Screen Time Tracker

Authors

TL;DR

Abstract

Table of Contents

Figures (7)