Table of Contents
Fetching ...

SeeSay: An Assistive Device for the Visually Impaired Using Retrieval Augmented Generation

Melody Yu

TL;DR

SeeSay addresses the need for richer, real-time environmental information for visually impaired users by integrating retrieval-augmented generation with LLMs to reason over current observations and stored visual history. The system comprises a glasses-attached module and a Raspberry Pi processing unit that perform local ASR/QA/TTS while offloading image description to cloud LLMs, forming a hybrid RAG pipeline with continuous scene capture and memory prompts. While effective for simple descriptive tasks and object localization, navigation and handwriting OCR lag due to latency from cloud processing and multi-turn reasoning, highlighting the trade-off between on-device privacy/cost and cloud-powered capabilities. The work demonstrates the feasibility of memory-augmented assistive devices and suggests future hardware upgrades to enable on-device LLMs for faster, private operation.

Abstract

In this paper, we present SeeSay, an assistive device designed for individuals with visual impairments. This system leverages large language models (LLMs) for speech recognition and visual querying. It effectively identifies, records, and responds to the user's environment by providing audio guidance using retrieval-augmented generation (RAG). Our experiments demonstrate the system's capability to recognize its surroundings and respond to queries with audio feedback in diverse settings. We hope that the SeeSay system will facilitate users' comprehension and recollection of their surroundings, thereby enhancing their environmental perception, improving navigational capabilities, and boosting overall independence.

SeeSay: An Assistive Device for the Visually Impaired Using Retrieval Augmented Generation

TL;DR

SeeSay addresses the need for richer, real-time environmental information for visually impaired users by integrating retrieval-augmented generation with LLMs to reason over current observations and stored visual history. The system comprises a glasses-attached module and a Raspberry Pi processing unit that perform local ASR/QA/TTS while offloading image description to cloud LLMs, forming a hybrid RAG pipeline with continuous scene capture and memory prompts. While effective for simple descriptive tasks and object localization, navigation and handwriting OCR lag due to latency from cloud processing and multi-turn reasoning, highlighting the trade-off between on-device privacy/cost and cloud-powered capabilities. The work demonstrates the feasibility of memory-augmented assistive devices and suggests future hardware upgrades to enable on-device LLMs for faster, private operation.

Abstract

In this paper, we present SeeSay, an assistive device designed for individuals with visual impairments. This system leverages large language models (LLMs) for speech recognition and visual querying. It effectively identifies, records, and responds to the user's environment by providing audio guidance using retrieval-augmented generation (RAG). Our experiments demonstrate the system's capability to recognize its surroundings and respond to queries with audio feedback in diverse settings. We hope that the SeeSay system will facilitate users' comprehension and recollection of their surroundings, thereby enhancing their environmental perception, improving navigational capabilities, and boosting overall independence.
Paper Structure (5 sections, 1 figure, 1 table)

This paper contains 5 sections, 1 figure, 1 table.

Figures (1)

  • Figure 1: Architecture of the SeeSay platform using LLM-based Retrieval-Augmented Generation