Table of Contents
Fetching ...

BridgeEQA: Virtual Embodied Agents for Real Bridge Inspections

Subin Varghese, Joshua Gao, Asad Ur Rahman, Vedhus Hoskere

TL;DR

Embodied Memory Visual Reasoning (EMVR), which formulates inspection as sequential navigation over an image-based scene graph: images are nodes, and an agent takes actions to traverse views, compare evidence, and reason within a Markov decision process shows strong performance over the baselines.

Abstract

Deploying embodied agents that can answer questions about their surroundings in realistic real-world settings remains difficult, partly due to the scarcity of benchmarks that faithfully capture practical operating conditions. We propose infrastructure inspection as a compelling domain for open-vocabulary Embodied Question Answering (EQA): it naturally demands multi-scale reasoning, long-range spatial understanding, and complex semantic relationships, while offering unique evaluation advantages via standardized National Bridge Inventory (NBI) condition ratings (0-9), professional inspection reports, and egocentric imagery. We introduce BridgeEQA, a benchmark of 2,200 open-vocabulary question-answer pairs (in the style of OpenEQA) grounded in professional inspection reports across 200 real-world bridge scenes with 47.93 images on average per scene. Questions require synthesizing visual evidence across multiple images and aligning responses with NBI condition ratings. We further propose a new EQA metric Image Citation Relevance to evaluate the ability of a model to cite relevant images. Evaluations of state-of-the-art vision-language models reveal substantial performance gaps under episodic memory EQA settings. To address this, we propose Embodied Memory Visual Reasoning (EMVR), which formulates inspection as sequential navigation over an image-based scene graph: images are nodes, and an agent takes actions to traverse views, compare evidence, and reason within a Markov decision process. EMVR shows strong performance over the baselines. We publicly release both the dataset and code.

BridgeEQA: Virtual Embodied Agents for Real Bridge Inspections

TL;DR

Embodied Memory Visual Reasoning (EMVR), which formulates inspection as sequential navigation over an image-based scene graph: images are nodes, and an agent takes actions to traverse views, compare evidence, and reason within a Markov decision process shows strong performance over the baselines.

Abstract

Deploying embodied agents that can answer questions about their surroundings in realistic real-world settings remains difficult, partly due to the scarcity of benchmarks that faithfully capture practical operating conditions. We propose infrastructure inspection as a compelling domain for open-vocabulary Embodied Question Answering (EQA): it naturally demands multi-scale reasoning, long-range spatial understanding, and complex semantic relationships, while offering unique evaluation advantages via standardized National Bridge Inventory (NBI) condition ratings (0-9), professional inspection reports, and egocentric imagery. We introduce BridgeEQA, a benchmark of 2,200 open-vocabulary question-answer pairs (in the style of OpenEQA) grounded in professional inspection reports across 200 real-world bridge scenes with 47.93 images on average per scene. Questions require synthesizing visual evidence across multiple images and aligning responses with NBI condition ratings. We further propose a new EQA metric Image Citation Relevance to evaluate the ability of a model to cite relevant images. Evaluations of state-of-the-art vision-language models reveal substantial performance gaps under episodic memory EQA settings. To address this, we propose Embodied Memory Visual Reasoning (EMVR), which formulates inspection as sequential navigation over an image-based scene graph: images are nodes, and an agent takes actions to traverse views, compare evidence, and reason within a Markov decision process. EMVR shows strong performance over the baselines. We publicly release both the dataset and code.

Paper Structure

This paper contains 23 sections, 11 figures, 2 tables.

Figures (11)

  • Figure 1: BridgeEQA: Open-Vocabulary Embodied Question Answering for bridge inspection. Two example scenes from our benchmark showing questions that require synthesizing visual evidence across multiple egocentric images to assess bridges.
  • Figure 2: Illustration of how EMVR mitigates the "lost in the middle" problem. By navigating the scene graph and dynamically selecting relevant images, EMVR repositions critical visual evidence at the end of a VLM's context window, reducing mid-sequence information loss.
  • Figure 3: Scene graph structure for bridge inspection. Nodes represent viewpoints with associated images, edges encode spatial and semantic relationships. VLM-generated labels shown in blue, edge relationships in gray.
  • Figure 4: Overview of Embodied Memory Visual Reasoning. An agent operates in an environment with a scene graph serving as an allocentric map. The agent navigates via an MDP, retrieving images dynamically to bring only relevant information into context.
  • Figure 5: Example Image Citation Relevance scores for varying image citation sets. Multiple citation sets can provide equally valid supporting evidence.
  • ...and 6 more figures