IndustryEQA: Pushing the Frontiers of Embodied Question Answering in Industrial Scenarios

Yifan Li; Yuhang Chen; Anh Dao; Lichi Li; Zhongyi Cai; Zhen Tan; Tianlong Chen; Yu Kong

IndustryEQA: Pushing the Frontiers of Embodied Question Answering in Industrial Scenarios

Yifan Li, Yuhang Chen, Anh Dao, Lichi Li, Zhongyi Cai, Zhen Tan, Tianlong Chen, Yu Kong

TL;DR

IndustryEQA introduces the first industrial embodied question answering benchmark, targeting safety-critical warehouse scenarios with high-fidelity Isaac Sim simulations. It provides episodic memory videos and 1,344 QA pairs across six categories (safety and perception) and supports extra reasoning annotations, all evaluated via an open-vocabulary, LLM-based scoring framework. The study reveals that while visual grounding substantially improves performance, complex reasoning—especially under safety constraints—remains challenging, and architectural choices significantly influence results. This benchmark aims to drive the development of more robust, safety-aware embodied agents for real-world industrial environments and points to future extensions, including multi-modal signals and active learning paradigms.

Abstract

Existing Embodied Question Answering (EQA) benchmarks primarily focus on household environments, often overlooking safety-critical aspects and reasoning processes pertinent to industrial settings. This drawback limits the evaluation of agent readiness for real-world industrial applications. To bridge this, we introduce IndustryEQA, the first benchmark dedicated to evaluating embodied agent capabilities within safety-critical warehouse scenarios. Built upon the NVIDIA Isaac Sim platform, IndustryEQA provides high-fidelity episodic memory videos featuring diverse industrial assets, dynamic human agents, and carefully designed hazardous situations inspired by real-world safety guidelines. The benchmark includes rich annotations covering six categories: equipment safety, human safety, object recognition, attribute recognition, temporal understanding, and spatial understanding. Besides, it also provides extra reasoning evaluation based on these categories. Specifically, it comprises 971 question-answer pairs generated from small warehouse and 373 pairs from large ones, incorporating scenarios with and without human. We further propose a comprehensive evaluation framework, including various baseline models, to assess their general perception and reasoning abilities in industrial environments. IndustryEQA aims to steer EQA research towards developing more robust, safety-aware, and practically applicable embodied agents for complex industrial environments. Benchmark and codes are available.

IndustryEQA: Pushing the Frontiers of Embodied Question Answering in Industrial Scenarios

TL;DR

Abstract

IndustryEQA: Pushing the Frontiers of Embodied Question Answering in Industrial Scenarios

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (16)