Table of Contents
Fetching ...

WorldScribe: Towards Context-Aware Live Visual Descriptions

Ruei-Che Chang, Yuxuan Liu, Anhong Guo

TL;DR

WorldScribe tackles the long-standing challenge of providing rich, context-aware live visual descriptions for blind and visually impaired users by integrating an intent-driven, context-adaptive pipeline that leverages multiple vision-language models and an audio-aware presentation layer. The system decomposes user intent, extracts keyframes, generates descriptions through a tiered VLM stack, and prioritizes outputs based on semantic relevance and proximity, while dynamically adjusting to sound context. A formative study informs design decisions, and a user study with six participants demonstrates feasibility, adaptive usefulness, and gaps around accuracy, realism, and practical navigation. Pipeline evaluations quantify accuracy, coverage, and prioritization, underscoring WorldScribe's potential for real-world accessibility while highlighting the need for long-term memory, humanized presentation, and dedicated benchmarking datasets. The work lays a foundation for scalable, context-aware live descriptions and suggests directions for integrating wearables, improved evaluation metrics, and future large-model capabilities to further close the accessibility gap.

Abstract

Automated live visual descriptions can aid blind people in understanding their surroundings with autonomy and independence. However, providing descriptions that are rich, contextual, and just-in-time has been a long-standing challenge in accessibility. In this work, we develop WorldScribe, a system that generates automated live real-world visual descriptions that are customizable and adaptive to users' contexts: (i) WorldScribe's descriptions are tailored to users' intents and prioritized based on semantic relevance. (ii) WorldScribe is adaptive to visual contexts, e.g., providing consecutively succinct descriptions for dynamic scenes, while presenting longer and detailed ones for stable settings. (iii) WorldScribe is adaptive to sound contexts, e.g., increasing volume in noisy environments, or pausing when conversations start. Powered by a suite of vision, language, and sound recognition models, WorldScribe introduces a description generation pipeline that balances the tradeoffs between their richness and latency to support real-time use. The design of WorldScribe is informed by prior work on providing visual descriptions and a formative study with blind participants. Our user study and subsequent pipeline evaluation show that WorldScribe can provide real-time and fairly accurate visual descriptions to facilitate environment understanding that is adaptive and customized to users' contexts. Finally, we discuss the implications and further steps toward making live visual descriptions more context-aware and humanized.

WorldScribe: Towards Context-Aware Live Visual Descriptions

TL;DR

WorldScribe tackles the long-standing challenge of providing rich, context-aware live visual descriptions for blind and visually impaired users by integrating an intent-driven, context-adaptive pipeline that leverages multiple vision-language models and an audio-aware presentation layer. The system decomposes user intent, extracts keyframes, generates descriptions through a tiered VLM stack, and prioritizes outputs based on semantic relevance and proximity, while dynamically adjusting to sound context. A formative study informs design decisions, and a user study with six participants demonstrates feasibility, adaptive usefulness, and gaps around accuracy, realism, and practical navigation. Pipeline evaluations quantify accuracy, coverage, and prioritization, underscoring WorldScribe's potential for real-world accessibility while highlighting the need for long-term memory, humanized presentation, and dedicated benchmarking datasets. The work lays a foundation for scalable, context-aware live descriptions and suggests directions for integrating wearables, improved evaluation metrics, and future large-model capabilities to further close the accessibility gap.

Abstract

Automated live visual descriptions can aid blind people in understanding their surroundings with autonomy and independence. However, providing descriptions that are rich, contextual, and just-in-time has been a long-standing challenge in accessibility. In this work, we develop WorldScribe, a system that generates automated live real-world visual descriptions that are customizable and adaptive to users' contexts: (i) WorldScribe's descriptions are tailored to users' intents and prioritized based on semantic relevance. (ii) WorldScribe is adaptive to visual contexts, e.g., providing consecutively succinct descriptions for dynamic scenes, while presenting longer and detailed ones for stable settings. (iii) WorldScribe is adaptive to sound contexts, e.g., increasing volume in noisy environments, or pausing when conversations start. Powered by a suite of vision, language, and sound recognition models, WorldScribe introduces a description generation pipeline that balances the tradeoffs between their richness and latency to support real-time use. The design of WorldScribe is informed by prior work on providing visual descriptions and a formative study with blind participants. Our user study and subsequent pipeline evaluation show that WorldScribe can provide real-time and fairly accurate visual descriptions to facilitate environment understanding that is adaptive and customized to users' contexts. Finally, we discuss the implications and further steps toward making live visual descriptions more context-aware and humanized.
Paper Structure (39 sections, 11 figures, 2 tables)

This paper contains 39 sections, 11 figures, 2 tables.

Figures (11)

  • Figure 1: (a) Sarah is exploring the zoo with her toddler using WorldScribe, (b) which describes surroundings to her. (c, d) They join a giraffe feeding tour, live visual descriptions narrate when giraffes reach out near her toddler, who feeds the lettuce leaves to them and (e, f) snaps a nice photo.
  • Figure 2: WorldScribe system architecture. (a) The user first specifies their intent through speech and WorldScribe decomposes it into specific visual attributes and relevant objects. (b) WorldScribe extracts keyframes based on user orientation, object compositions, and frame similarity. (c) Next, it generates candidate descriptions with a suite of visual and language models. (d) WorldScribe then prioritizes the descriptions based on the user’s intent, proximity to the user, and relevance to the current visual context. (e) Finally, it detects environmental sounds and manipulates the presentation of the descriptions accordingly.
  • Figure 3: (a) Brook is looking for a silver laptop using WorldScribe in the lab by first (b) specifying his intent. (c) As he moves quickly, WorldScribe reads out names of fixtures, and (d) pauses or increases its volume based on environmental sounds. When approaching his seat and Brook stops to scan, (e) WorldScribe provides verbose descriptions when the visual scene is relevant to his intent, (f) allowing him to follow the cues and find the laptop.
  • Figure 4: (a) Brook takes a break on the balcony and uses WorldScribe to explore his surroundings. (b) Through the live visual descriptions, he knows the sky is sunny, (c) plants are growing, and also notices (d) his friends are here. (e) He then joins them and has a delightful tea time. (f) WorldScribe facilitate the understanding and access of his surroundings, and make his day.
  • Figure 5: WorldScribe user interface. (a) The user can specify their intent and needs regarding visual attributes or audio presentation through speech input. (b) Besides speech, they can manually select options for richness and other visual attributes. (c) They can also configure pauses or increase the volume of descriptions if certain sound events are detected.
  • ...and 6 more figures