TouchScribe: Augmenting Non-Visual Hand-Object Interactions with Automated Live Visual Descriptions
Ruei-Che Chang, Rosiana Natalie, Wenqian Xu, Jovan Zheng Feng Yap, Tiange Luo, Venkatesh Potluri, Anhong Guo
TL;DR
This work addresses the challenge that blind and low-vision users face in accessing rich visual features through touch alone. It introduces TouchScribe, a system that uses egocentric hand gestures as information cursors to proactively generate live visual descriptions via a dual VLM pipeline (Moondream and GPT-4o) combined with a lightweight gesture-recognition front-end. Through a lab study with eight BLV participants and a technical evaluation, the authors demonstrate reasonably high description accuracy ($≈91$–$94%$) and acceptable end-to-end latencies (mean hand-state ≈ $0.56$ s; longer text reads ≈ $10$–$14$ s), while highlighting challenges from wide FoV distortion, gesture misclassifications, and cognitive load. The results offer design implications for real-time accessibility tools, including gesture customization, low-latency perception, camera configurations, and potential multi-sensor fusion to broaden applicability beyond physical reach.
Abstract
People who are blind or have low vision regularly use their hands to interact with the physical world to gain access to objects' shape, size, weight, and texture. However, many rich visual features remain inaccessible through touch alone, making it difficult to distinguish similar objects, interpret visual affordances, and form a complete understanding of objects. In this work, we present TouchScribe, a system that augments hand-object interactions with automated live visual descriptions. We trained a custom egocentric hand interaction model to recognize both common gestures (e.g., grab to inspect, hold side-by-side to compare) and unique ones by blind people (e.g., point to explore color, or swipe to read available texts). Furthermore, TouchScribe provides real-time and adaptive feedback based on hand movement, from hand interaction states, to object labels, and to visual details. Our user study and technical evaluations demonstrate that TouchScribe can provide rich and useful descriptions to support object understanding. Finally, we discuss the implications of making live visual descriptions responsive to users' physical reach.
