Table of Contents
Fetching ...

TouchScribe: Augmenting Non-Visual Hand-Object Interactions with Automated Live Visual Descriptions

Ruei-Che Chang, Rosiana Natalie, Wenqian Xu, Jovan Zheng Feng Yap, Tiange Luo, Venkatesh Potluri, Anhong Guo

TL;DR

This work addresses the challenge that blind and low-vision users face in accessing rich visual features through touch alone. It introduces TouchScribe, a system that uses egocentric hand gestures as information cursors to proactively generate live visual descriptions via a dual VLM pipeline (Moondream and GPT-4o) combined with a lightweight gesture-recognition front-end. Through a lab study with eight BLV participants and a technical evaluation, the authors demonstrate reasonably high description accuracy ($≈91$–$94%$) and acceptable end-to-end latencies (mean hand-state ≈ $0.56$ s; longer text reads ≈ $10$–$14$ s), while highlighting challenges from wide FoV distortion, gesture misclassifications, and cognitive load. The results offer design implications for real-time accessibility tools, including gesture customization, low-latency perception, camera configurations, and potential multi-sensor fusion to broaden applicability beyond physical reach.

Abstract

People who are blind or have low vision regularly use their hands to interact with the physical world to gain access to objects' shape, size, weight, and texture. However, many rich visual features remain inaccessible through touch alone, making it difficult to distinguish similar objects, interpret visual affordances, and form a complete understanding of objects. In this work, we present TouchScribe, a system that augments hand-object interactions with automated live visual descriptions. We trained a custom egocentric hand interaction model to recognize both common gestures (e.g., grab to inspect, hold side-by-side to compare) and unique ones by blind people (e.g., point to explore color, or swipe to read available texts). Furthermore, TouchScribe provides real-time and adaptive feedback based on hand movement, from hand interaction states, to object labels, and to visual details. Our user study and technical evaluations demonstrate that TouchScribe can provide rich and useful descriptions to support object understanding. Finally, we discuss the implications of making live visual descriptions responsive to users' physical reach.

TouchScribe: Augmenting Non-Visual Hand-Object Interactions with Automated Live Visual Descriptions

TL;DR

This work addresses the challenge that blind and low-vision users face in accessing rich visual features through touch alone. It introduces TouchScribe, a system that uses egocentric hand gestures as information cursors to proactively generate live visual descriptions via a dual VLM pipeline (Moondream and GPT-4o) combined with a lightweight gesture-recognition front-end. Through a lab study with eight BLV participants and a technical evaluation, the authors demonstrate reasonably high description accuracy () and acceptable end-to-end latencies (mean hand-state ≈ s; longer text reads ≈ s), while highlighting challenges from wide FoV distortion, gesture misclassifications, and cognitive load. The results offer design implications for real-time accessibility tools, including gesture customization, low-latency perception, camera configurations, and potential multi-sensor fusion to broaden applicability beyond physical reach.

Abstract

People who are blind or have low vision regularly use their hands to interact with the physical world to gain access to objects' shape, size, weight, and texture. However, many rich visual features remain inaccessible through touch alone, making it difficult to distinguish similar objects, interpret visual affordances, and form a complete understanding of objects. In this work, we present TouchScribe, a system that augments hand-object interactions with automated live visual descriptions. We trained a custom egocentric hand interaction model to recognize both common gestures (e.g., grab to inspect, hold side-by-side to compare) and unique ones by blind people (e.g., point to explore color, or swipe to read available texts). Furthermore, TouchScribe provides real-time and adaptive feedback based on hand movement, from hand interaction states, to object labels, and to visual details. Our user study and technical evaluations demonstrate that TouchScribe can provide rich and useful descriptions to support object understanding. Finally, we discuss the implications of making live visual descriptions responsive to users' physical reach.
Paper Structure (45 sections, 7 figures, 5 tables)

This paper contains 45 sections, 7 figures, 5 tables.

Figures (7)

  • Figure 1: Overview of the variety of gestures, timings to extract keyframes, and description types supported by TouchScribe.
  • Figure 2: TouchScribe System Diagram. (a) TouchScribe performs gesture recognition on live video streams. For each camera frame, hand landmarks are extracted with Google MediaPipe mediapipe and classified into predefined gesture categories. A temporal smoothing module then aggregates multiple frames to produce stable keyframes and gesture states. (b) For each keyframe, Hands23 hand23 infers object contact. The contact data, together with a cropped image of the object, is passed to VLMs for further processing. (c) VLMs, including Moondream moondream and GPT-4o gpt4o, are executed in parallel to generate rich object descriptions. (d) When one stable state is hold and the other is point, TouchScribe reads the color of the small region the finger is pointing to. (e) When one stable state is hold and the other is touch, TouchScribe tracks finger motion and reads the text once both fingers move up. (f) TouchScribe also maintains a history of cropped objects and identifies flipped instances by comparing image similarity, and re-runs the generation pipeline on the updated crop.
  • Figure 3: The TouchScribe prototype setup included an adjustable neck mount with an attached smartphone. During the study, researchers adjusted the mount for each participant to ensure the camera was properly aimed at the table.
  • Figure 4: Likert scale questions and aggregated responses of eight participants in our user study. This includes questions about coverage (M=6.5, SD=1.07), effectiveness (M=6, SD=0.76), intuitiveness of gestures (M=5.63, SD1.41), accuracy of descriptions (M=5.5, SD=1.6), usefulness (M=5.5, SD=1.69), and agency of using TouchScribe (M=5.13, SD=2.23).
  • Figure 5: NASA-TLX responses from the user study. Higher scores on the Performance dimension indicate better outcomes, whereas lower scores on the remaining dimensions reflect better outcomes.
  • ...and 2 more figures