Table of Contents
Fetching ...

Image Quality Assessment for Embodied AI

Chunyi Li, Jiaohao Xiao, Jianbo Zhang, Farong Wen, Zicheng Zhang, Yuan Tian, Xiangyang Zhu, Xiaohong Liu, Zhengxue Cheng, Weisi Lin, Guangtao Zhai

TL;DR

This work introduces Embodied IQA, a dedicated quality assessment framework for Embodied AI that accounts for perception, cognition, decision-making, and execution. Grounded in a Mertonian Perception-Cognition-Decision-Execution pipeline, the authors build Embodied-IQA, a large dataset with 36.9k reference/distorted image pairs and over 5 million VLM/VLA-derived annotations, plus real-world UR5 robot validation. They benchmark 15 IQA metrics and reveal that traditional human-oriented metrics underperform in Embodied contexts, with FR methods typically superior but still leaving a gap to reach human-like perceptual reliability. The study demonstrates correlations between Cognition, Decision, and Execution in real-world tasks and argues for developing Embodied-specific QoIs to enable robust robot deployment under real-world distortions.

Abstract

Embodied AI has developed rapidly in recent years, but it is still mainly deployed in laboratories, with various distortions in the Real-world limiting its application. Traditionally, Image Quality Assessment (IQA) methods are applied to predict human preferences for distorted images; however, there is no IQA method to assess the usability of an image in embodied tasks, namely, the perceptual quality for robots. To provide accurate and reliable quality indicators for future embodied scenarios, we first propose the topic: IQA for Embodied AI. Specifically, we (1) based on the Mertonian system and meta-cognitive theory, constructed a perception-cognition-decision-execution pipeline and defined a comprehensive subjective score collection process; (2) established the Embodied-IQA database, containing over 36k reference/distorted image pairs, with more than 5m fine-grained annotations provided by Vision Language Models/Vision Language Action-models/Real-world robots; (3) trained and validated the performance of mainstream IQA methods on Embodied-IQA, demonstrating the need to develop more accurate quality indicators for Embodied AI. We sincerely hope that through evaluation, we can promote the application of Embodied AI under complex distortions in the Real-world. Project page: https://github.com/lcysyzxdxc/EmbodiedIQA

Image Quality Assessment for Embodied AI

TL;DR

This work introduces Embodied IQA, a dedicated quality assessment framework for Embodied AI that accounts for perception, cognition, decision-making, and execution. Grounded in a Mertonian Perception-Cognition-Decision-Execution pipeline, the authors build Embodied-IQA, a large dataset with 36.9k reference/distorted image pairs and over 5 million VLM/VLA-derived annotations, plus real-world UR5 robot validation. They benchmark 15 IQA metrics and reveal that traditional human-oriented metrics underperform in Embodied contexts, with FR methods typically superior but still leaving a gap to reach human-like perceptual reliability. The study demonstrates correlations between Cognition, Decision, and Execution in real-world tasks and argues for developing Embodied-specific QoIs to enable robust robot deployment under real-world distortions.

Abstract

Embodied AI has developed rapidly in recent years, but it is still mainly deployed in laboratories, with various distortions in the Real-world limiting its application. Traditionally, Image Quality Assessment (IQA) methods are applied to predict human preferences for distorted images; however, there is no IQA method to assess the usability of an image in embodied tasks, namely, the perceptual quality for robots. To provide accurate and reliable quality indicators for future embodied scenarios, we first propose the topic: IQA for Embodied AI. Specifically, we (1) based on the Mertonian system and meta-cognitive theory, constructed a perception-cognition-decision-execution pipeline and defined a comprehensive subjective score collection process; (2) established the Embodied-IQA database, containing over 36k reference/distorted image pairs, with more than 5m fine-grained annotations provided by Vision Language Models/Vision Language Action-models/Real-world robots; (3) trained and validated the performance of mainstream IQA methods on Embodied-IQA, demonstrating the need to develop more accurate quality indicators for Embodied AI. We sincerely hope that through evaluation, we can promote the application of Embodied AI under complex distortions in the Real-world. Project page: https://github.com/lcysyzxdxc/EmbodiedIQA

Paper Structure

This paper contains 31 sections, 15 equations, 14 figures, 4 tables.

Figures (14)

  • Figure 1: The significant gap between human, machine, and robot visual systems. Humans and Machines are sensitive to different distortions, while Robots have Decision and Execution steps beyond Cognition, highlighting the importance of a Perception quality index for Embodied AI.
  • Figure 2: Database construction of the Embodied-IQA, with 30k+ large-scale reference/distorted image pairs, meticulously annotated with 2m+ fine-grained Cognition score from 15 mainstream VLMs, 2m+ Decision score from 15 VLAs, and 1.5k real-world experiments as Execution score.
  • Figure 3: Benchmarking VLMs&VLAs in 3 different score dimensions and 5 distortion levels. Their performance varies in 3 dimensions and decreases with the distortion. (Zoom in for detail)
  • Figure 4: Correlation matrix of VLMs&VLAs subjects, the a-o order follows Section \ref{['sec:vlm']},\ref{['sec:vla']}. Darker colors denote a higher SRCC, with the averaged SRCC attached to the bottom of the matrix.
  • Figure 5: Decision score visualized in 30 distortion subsets. Different color denotes distortion Level 1-Level 2-Level 3-Level 4-Level 5. Different distortions affecting VLAs vary significantly.
  • ...and 9 more figures