Table of Contents
Fetching ...

TEXT2TASTE: A Versatile Egocentric Vision System for Intelligent Reading Assistance Using Large Language Model

Wiktor Mucha, Florin Cuconasu, Naome A. Etori, Valia Kalokyri, Giovanni Trappolini

TL;DR

This work tackles reading and information access for visually impaired individuals by integrating egocentric vision from Aria smart glasses with OCR and a Large Language Model (GPT-4). The system localizes text in the user’s field of view using DETIC, converts it into a digital, structured format via OCR, and delivers personalized, context-aware answers through a Retrieval-Augmented Generation pipeline that leverages user data. The approach is demonstrated on multilingual restaurant menus, achieving 96.77% text retrieval accuracy and high user satisfaction (average 4.87/5) across four languages, illustrating robust, real-world applicability. Overall, the study demonstrates the viability of combining wearable egocentric vision with LLM-based reasoning to enhance accessibility and independence in daily information tasks for people with special needs.

Abstract

The ability to read, understand and find important information from written text is a critical skill in our daily lives for our independence, comfort and safety. However, a significant part of our society is affected by partial vision impairment, which leads to discomfort and dependency in daily activities. To address the limitations of this part of society, we propose an intelligent reading assistant based on smart glasses with embedded RGB cameras and a Large Language Model (LLM), whose functionality goes beyond corrective lenses. The video recorded from the egocentric perspective of a person wearing the glasses is processed to localise text information using object detection and optical character recognition methods. The LLM processes the data and allows the user to interact with the text and responds to a given query, thus extending the functionality of corrective lenses with the ability to find and summarize knowledge from the text. To evaluate our method, we create a chat-based application that allows the user to interact with the system. The evaluation is conducted in a real-world setting, such as reading menus in a restaurant, and involves four participants. The results show robust accuracy in text retrieval. The system not only provides accurate meal suggestions but also achieves high user satisfaction, highlighting the potential of smart glasses and LLMs in assisting people with special needs.

TEXT2TASTE: A Versatile Egocentric Vision System for Intelligent Reading Assistance Using Large Language Model

TL;DR

This work tackles reading and information access for visually impaired individuals by integrating egocentric vision from Aria smart glasses with OCR and a Large Language Model (GPT-4). The system localizes text in the user’s field of view using DETIC, converts it into a digital, structured format via OCR, and delivers personalized, context-aware answers through a Retrieval-Augmented Generation pipeline that leverages user data. The approach is demonstrated on multilingual restaurant menus, achieving 96.77% text retrieval accuracy and high user satisfaction (average 4.87/5) across four languages, illustrating robust, real-world applicability. Overall, the study demonstrates the viability of combining wearable egocentric vision with LLM-based reasoning to enhance accessibility and independence in daily information tasks for people with special needs.

Abstract

The ability to read, understand and find important information from written text is a critical skill in our daily lives for our independence, comfort and safety. However, a significant part of our society is affected by partial vision impairment, which leads to discomfort and dependency in daily activities. To address the limitations of this part of society, we propose an intelligent reading assistant based on smart glasses with embedded RGB cameras and a Large Language Model (LLM), whose functionality goes beyond corrective lenses. The video recorded from the egocentric perspective of a person wearing the glasses is processed to localise text information using object detection and optical character recognition methods. The LLM processes the data and allows the user to interact with the text and responds to a given query, thus extending the functionality of corrective lenses with the ability to find and summarize knowledge from the text. To evaluate our method, we create a chat-based application that allows the user to interact with the system. The evaluation is conducted in a real-world setting, such as reading menus in a restaurant, and involves four participants. The results show robust accuracy in text retrieval. The system not only provides accurate meal suggestions but also achieves high user satisfaction, highlighting the potential of smart glasses and LLMs in assisting people with special needs.
Paper Structure (9 sections, 2 figures)

This paper contains 9 sections, 2 figures.

Figures (2)

  • Figure 1: Overview of the proposed method for a specific task of reading a menu card. Firstly, the action of reading a menu card is recorded using Aria smart glasses, then this video is processed to select a keyframe and extract the text using EasyOCR and to retrieve the digital menu using the GPT4 model. The digital menu is fed with personalised digital data using the RAG model, resulting in a personalised food preference.
  • Figure 2: On the left, frames of a video recorded with Aria. From the sequence of frames, we select the one with the most centred position in the sequence. On the right, a user wearing the Aria device interacts with a menu card.