Table of Contents
Fetching ...

SweeperBot: Making 3D Browsing Accessible through View Analysis and Visual Question Answering

Chen Chen, Cuong Nguyen, Alexa Siu, Dingzeyu Li, Nadir Weibel

TL;DR

BLV users struggle to access and compare 3D content, especially in online shopping and design contexts. SweeperBot couples a novel three-stage VQA pipeline with an SR-accessible editable table to answer visual questions from multiple sampled views, guided by CLIP for relevance and Grounding DINO for object recognition, and uses LLM/MLLM-based reasoning for final answers. The work introduces an SR-friendly interface and validates it through an expert BLV study (n=10) and a sighted evaluation (n=30) of generated descriptions, demonstrating improved accessibility and decision support for 3D browsing. Findings suggest practical applicability to e-commerce, education, and GenAI-driven 3D workflows, with potential extensions to larger scenes and real-world deployments.

Abstract

Accessing 3D models remains challenging for Screen Reader (SR) users. While some existing 3D viewers allow creators to provide alternative text, they often lack sufficient detail about the 3D models. Grounded on a formative study, this paper introduces SweeperBot, a system that enables SR users to leverage visual question answering to explore and compare 3D models. SweeperBot answers SR users' visual questions by combining an optimal view selection technique with the strength of generative- and recognition-based foundation models. An expert review with 10 Blind and Low-Vision (BLV) users with SR experience demonstrated the feasibility of using SweeperBot to assist BLV users in exploring and comparing 3D models. The quality of the descriptions generated by SweeperBot was validated by a second survey study with 30 sighted participants.

SweeperBot: Making 3D Browsing Accessible through View Analysis and Visual Question Answering

TL;DR

BLV users struggle to access and compare 3D content, especially in online shopping and design contexts. SweeperBot couples a novel three-stage VQA pipeline with an SR-accessible editable table to answer visual questions from multiple sampled views, guided by CLIP for relevance and Grounding DINO for object recognition, and uses LLM/MLLM-based reasoning for final answers. The work introduces an SR-friendly interface and validates it through an expert BLV study (n=10) and a sighted evaluation (n=30) of generated descriptions, demonstrating improved accessibility and decision support for 3D browsing. Findings suggest practical applicability to e-commerce, education, and GenAI-driven 3D workflows, with potential extensions to larger scenes and real-world deployments.

Abstract

Accessing 3D models remains challenging for Screen Reader (SR) users. While some existing 3D viewers allow creators to provide alternative text, they often lack sufficient detail about the 3D models. Grounded on a formative study, this paper introduces SweeperBot, a system that enables SR users to leverage visual question answering to explore and compare 3D models. SweeperBot answers SR users' visual questions by combining an optimal view selection technique with the strength of generative- and recognition-based foundation models. An expert review with 10 Blind and Low-Vision (BLV) users with SR experience demonstrated the feasibility of using SweeperBot to assist BLV users in exploring and comparing 3D models. The quality of the descriptions generated by SweeperBot was validated by a second survey study with 30 sighted participants.

Paper Structure

This paper contains 28 sections, 3 equations, 16 figures.

Figures (16)

  • Figure 1: (a) SweeperBot generates descriptions based on Blind and Low Vision (BLV) users' visual questions (top left). The description created by SweeperBot (pink box, right) more accurately answers the visual question, compared to the baselines (blue boxes, right) using the canonical view chosen by the creator. (b - c) The generated table supports BLV users to navigate the generated descriptions with existing Screen Readers (SRs). (d) SweeperBot's interface with an SR-accessible editable table.
  • Figure 2: Pipeline for view sampling and selections. $42$ views are first sampled by navigating viewing camera, where $\bm{s}$ refers to the similarity score (a); the VQA pipeline then (c) extracts the key entities from the visual questions, (b) searches CLIP-relevant views and (d) removes semantic repetitive views; (e) the final selected object-relevant views, where $\bm{o}$ indicates object score.
  • Figure 3: Examples of using flatness score to measure the CLIP relevancy; (a) examples of how flatness score could be approximated; the flatness (b) and similarity score (c) at sampled rotational angles along latitudinal ($\alpha$) and longitudinal ($\beta$) axis, provided $r = 0.5d + 0.2$; (d - e) examples when flatness is used to enhance the reliability of the CLIP relevancy approximations.
  • Figure 4: Demonstration of compositional visual reasoning for answer generations in Stage 3. (a) Python code generated by an LLM for compositional visual reasoning; (b - e) recognition results of the "display" using Grounding DINO Liu2023GroundingDINO by evaluating the selected views from Figure \ref{['fig::pipeline']}e; (f) synthesis generated answers from selected views.
  • Figure 5: Responses of usability questions of Study 1. Questions were assessed through a 7-point Likert scale.
  • ...and 11 more figures