Table of Contents
Fetching ...

RoHOI: Robustness Benchmark for Human-Object Interaction Detection

Di Wen, Kunyu Peng, Kailun Yang, Yufan Chen, Ruiping Liu, Junwei Zheng, Alina Roitberg, Danda Pani Paudel, Luc Van Gool, Rainer Stiefelhagen

TL;DR

The paper tackles the fragility of HOI detection in real-world conditions by introducing RoHOI, the first robustness benchmark with 20 corruption types drawn from HICO-DET and V-COCO, and two robustness metrics MRI and CRI defined as $MRI = \frac{1}{C}\sum_{c=1}^{C}\left(\frac{1}{L_c}\sum_{l=1}^{L_c} M_{c,l}\right)$ and $CRI = \frac{1}{C}\sum_{c=1}^{C}\left(\frac{\overline{M}_c}{M_{clean}}\cdot \frac{1}{\log\!(1+\sigma_c)+1}\right)$. To improve robustness, it proposes SAMPL, a Semantic-Aware Masking-based Progressive Learning approach that uses SAM-guided masks and a score-guided curriculum to train models on both holistic and partial cues. Empirical results show SAMPL achieving state-of-the-art robustness across corruption categories while maintaining clean performance, highlighting the importance of structured perturbations and cross-modal pretraining for reliable HOI detection in safety-critical applications. RoHOI, SAMPL, and the accompanying datasets/code provide a foundation for advancing robust HOI detection in real-world scenarios.

Abstract

Human-Object Interaction (HOI) detection is crucial for robot-human assistance, enabling context-aware support. However, models trained on clean datasets degrade in real-world conditions due to unforeseen corruptions, leading to inaccurate predictions. To address this, we introduce the first robustness benchmark for HOI detection, evaluating model resilience under diverse challenges. Despite advances, current models struggle with environmental variability, occlusions, and noise. Our benchmark, RoHOI, includes 20 corruption types based on the HICO-DET and V-COCO datasets and a new robustness-focused metric. We systematically analyze existing models in the HOI field, revealing significant performance drops under corruptions. To improve robustness, we propose a Semantic-Aware Masking-based Progressive Learning (SAMPL) strategy to guide the model to be optimized based on holistic and partial cues, thus dynamically adjusting the model's optimization to enhance robust feature learning. Extensive experiments show that our approach outperforms state-of-the-art methods, setting a new standard for robust HOI detection. Benchmarks, datasets, and code are available at https://github.com/KratosWen/RoHOI.

RoHOI: Robustness Benchmark for Human-Object Interaction Detection

TL;DR

The paper tackles the fragility of HOI detection in real-world conditions by introducing RoHOI, the first robustness benchmark with 20 corruption types drawn from HICO-DET and V-COCO, and two robustness metrics MRI and CRI defined as and . To improve robustness, it proposes SAMPL, a Semantic-Aware Masking-based Progressive Learning approach that uses SAM-guided masks and a score-guided curriculum to train models on both holistic and partial cues. Empirical results show SAMPL achieving state-of-the-art robustness across corruption categories while maintaining clean performance, highlighting the importance of structured perturbations and cross-modal pretraining for reliable HOI detection in safety-critical applications. RoHOI, SAMPL, and the accompanying datasets/code provide a foundation for advancing robust HOI detection in real-world scenarios.

Abstract

Human-Object Interaction (HOI) detection is crucial for robot-human assistance, enabling context-aware support. However, models trained on clean datasets degrade in real-world conditions due to unforeseen corruptions, leading to inaccurate predictions. To address this, we introduce the first robustness benchmark for HOI detection, evaluating model resilience under diverse challenges. Despite advances, current models struggle with environmental variability, occlusions, and noise. Our benchmark, RoHOI, includes 20 corruption types based on the HICO-DET and V-COCO datasets and a new robustness-focused metric. We systematically analyze existing models in the HOI field, revealing significant performance drops under corruptions. To improve robustness, we propose a Semantic-Aware Masking-based Progressive Learning (SAMPL) strategy to guide the model to be optimized based on holistic and partial cues, thus dynamically adjusting the model's optimization to enhance robust feature learning. Extensive experiments show that our approach outperforms state-of-the-art methods, setting a new standard for robust HOI detection. Benchmarks, datasets, and code are available at https://github.com/KratosWen/RoHOI.

Paper Structure

This paper contains 18 sections, 8 equations, 7 figures, 3 tables, 1 algorithm.

Figures (7)

  • Figure 1: Our RoHOI dataset comprises 20 types of algorithmically generated corruptions, systematically categorized into four groups. Each corruption type has five levels of severity, leading to a total of 100 distinct corruptions. These corruptions simulate realistic semantic and structural disturbances specifically encountered in practical applications.
  • Figure 2: (a) Performance degradation of RLIPv2 yuan2023rlipv2 on HICO-DET and V-COCO under RoHOI. The left radar chart shows the relative mAP (Full) drop on HICO-DET, while the right shows the decline in $\mathrm{AP}^{\#2}_{\text{role}}$ on V-COCO. Clean dataset performance serves as a reference, with shaded areas indicating corruption impact. (b) Comparison of SAMPL with strong two-stage and one-stage baselines trained on V-COCO under RoHOI benchmarks. The left bar chart groups performance by corruption categories, while the right radar chart compares models across five severity levels for three corruptions. Corruptions are denoted as Abbreviation, i.e., $L\gamma$, where $\gamma$ represents the severity level. Details are shown in Sec. \ref{['sec:benchmark:corruptions']} and Appendix.
  • Figure 3: Qualitative comparison of HOI detection under corruption between SAMPL (ours) and RLIPv2 yuan2023rlipv2. The first row shows uncorrupted images, while the second and third rows present SAMPL and RLIPv2 results on corrupted images. Missed predictions are in dashed red, false interactions in red, and correct ones in blue.
  • Figure 4: Visual examples of Optical System (OS)-Induced artifacts across severity levels.
  • Figure 5: Visual examples of Sensor, Compression and Transmission (SCT) artifacts across severity levels.
  • ...and 2 more figures