Table of Contents
Fetching ...

IndustryEQA: Pushing the Frontiers of Embodied Question Answering in Industrial Scenarios

Yifan Li, Yuhang Chen, Anh Dao, Lichi Li, Zhongyi Cai, Zhen Tan, Tianlong Chen, Yu Kong

TL;DR

IndustryEQA introduces the first industrial embodied question answering benchmark, targeting safety-critical warehouse scenarios with high-fidelity Isaac Sim simulations. It provides episodic memory videos and 1,344 QA pairs across six categories (safety and perception) and supports extra reasoning annotations, all evaluated via an open-vocabulary, LLM-based scoring framework. The study reveals that while visual grounding substantially improves performance, complex reasoning—especially under safety constraints—remains challenging, and architectural choices significantly influence results. This benchmark aims to drive the development of more robust, safety-aware embodied agents for real-world industrial environments and points to future extensions, including multi-modal signals and active learning paradigms.

Abstract

Existing Embodied Question Answering (EQA) benchmarks primarily focus on household environments, often overlooking safety-critical aspects and reasoning processes pertinent to industrial settings. This drawback limits the evaluation of agent readiness for real-world industrial applications. To bridge this, we introduce IndustryEQA, the first benchmark dedicated to evaluating embodied agent capabilities within safety-critical warehouse scenarios. Built upon the NVIDIA Isaac Sim platform, IndustryEQA provides high-fidelity episodic memory videos featuring diverse industrial assets, dynamic human agents, and carefully designed hazardous situations inspired by real-world safety guidelines. The benchmark includes rich annotations covering six categories: equipment safety, human safety, object recognition, attribute recognition, temporal understanding, and spatial understanding. Besides, it also provides extra reasoning evaluation based on these categories. Specifically, it comprises 971 question-answer pairs generated from small warehouse and 373 pairs from large ones, incorporating scenarios with and without human. We further propose a comprehensive evaluation framework, including various baseline models, to assess their general perception and reasoning abilities in industrial environments. IndustryEQA aims to steer EQA research towards developing more robust, safety-aware, and practically applicable embodied agents for complex industrial environments. Benchmark and codes are available.

IndustryEQA: Pushing the Frontiers of Embodied Question Answering in Industrial Scenarios

TL;DR

IndustryEQA introduces the first industrial embodied question answering benchmark, targeting safety-critical warehouse scenarios with high-fidelity Isaac Sim simulations. It provides episodic memory videos and 1,344 QA pairs across six categories (safety and perception) and supports extra reasoning annotations, all evaluated via an open-vocabulary, LLM-based scoring framework. The study reveals that while visual grounding substantially improves performance, complex reasoning—especially under safety constraints—remains challenging, and architectural choices significantly influence results. This benchmark aims to drive the development of more robust, safety-aware embodied agents for real-world industrial environments and points to future extensions, including multi-modal signals and active learning paradigms.

Abstract

Existing Embodied Question Answering (EQA) benchmarks primarily focus on household environments, often overlooking safety-critical aspects and reasoning processes pertinent to industrial settings. This drawback limits the evaluation of agent readiness for real-world industrial applications. To bridge this, we introduce IndustryEQA, the first benchmark dedicated to evaluating embodied agent capabilities within safety-critical warehouse scenarios. Built upon the NVIDIA Isaac Sim platform, IndustryEQA provides high-fidelity episodic memory videos featuring diverse industrial assets, dynamic human agents, and carefully designed hazardous situations inspired by real-world safety guidelines. The benchmark includes rich annotations covering six categories: equipment safety, human safety, object recognition, attribute recognition, temporal understanding, and spatial understanding. Besides, it also provides extra reasoning evaluation based on these categories. Specifically, it comprises 971 question-answer pairs generated from small warehouse and 373 pairs from large ones, incorporating scenarios with and without human. We further propose a comprehensive evaluation framework, including various baseline models, to assess their general perception and reasoning abilities in industrial environments. IndustryEQA aims to steer EQA research towards developing more robust, safety-aware, and practically applicable embodied agents for complex industrial environments. Benchmark and codes are available.

Paper Structure

This paper contains 33 sections, 1 equation, 16 figures, 2 tables.

Figures (16)

  • Figure 1: An illustration of the IndustryEQA benchmark, consisting of episodic memory videos and annotations. IndustryEQA annotations incorporate six types of annotations, covering safety (equipment safety and human safety) and general perception capabilities (object recognition, attribute recognition, temporal understanding and spatial understanding). Furthermore, it also incorporates extra reasoning answers for the questions that require deeper thinking.
  • Figure 2: An illustration of industrial scenarios in two sizes (small and large), each comprising three types: empty, without humans, and with humans.
  • Figure 3: An illustration of data generation pipeline. It consists of three main steps, capturing the video, generating and refining the question answer pairs using an advanced LLM, and finally, having human experts manually filtering out irrelevant pairs and reannotating selected ones.
  • Figure 4: IndustryEQA statistics for small and large warehouses: question category distribution (pie chart) and the time distribution (box plot). The inner ring and outer ring indicate the reasoning and direct QA distribution, respectively. The red dimonds in the box plot denotes the mean time.
  • Figure 5: An illustration of the question ID 364 in small warehouse.
  • ...and 11 more figures