RoHOI: Robustness Benchmark for Human-Object Interaction Detection
Di Wen, Kunyu Peng, Kailun Yang, Yufan Chen, Ruiping Liu, Junwei Zheng, Alina Roitberg, Danda Pani Paudel, Luc Van Gool, Rainer Stiefelhagen
TL;DR
The paper tackles the fragility of HOI detection in real-world conditions by introducing RoHOI, the first robustness benchmark with 20 corruption types drawn from HICO-DET and V-COCO, and two robustness metrics MRI and CRI defined as $MRI = \frac{1}{C}\sum_{c=1}^{C}\left(\frac{1}{L_c}\sum_{l=1}^{L_c} M_{c,l}\right)$ and $CRI = \frac{1}{C}\sum_{c=1}^{C}\left(\frac{\overline{M}_c}{M_{clean}}\cdot \frac{1}{\log\!(1+\sigma_c)+1}\right)$. To improve robustness, it proposes SAMPL, a Semantic-Aware Masking-based Progressive Learning approach that uses SAM-guided masks and a score-guided curriculum to train models on both holistic and partial cues. Empirical results show SAMPL achieving state-of-the-art robustness across corruption categories while maintaining clean performance, highlighting the importance of structured perturbations and cross-modal pretraining for reliable HOI detection in safety-critical applications. RoHOI, SAMPL, and the accompanying datasets/code provide a foundation for advancing robust HOI detection in real-world scenarios.
Abstract
Human-Object Interaction (HOI) detection is crucial for robot-human assistance, enabling context-aware support. However, models trained on clean datasets degrade in real-world conditions due to unforeseen corruptions, leading to inaccurate predictions. To address this, we introduce the first robustness benchmark for HOI detection, evaluating model resilience under diverse challenges. Despite advances, current models struggle with environmental variability, occlusions, and noise. Our benchmark, RoHOI, includes 20 corruption types based on the HICO-DET and V-COCO datasets and a new robustness-focused metric. We systematically analyze existing models in the HOI field, revealing significant performance drops under corruptions. To improve robustness, we propose a Semantic-Aware Masking-based Progressive Learning (SAMPL) strategy to guide the model to be optimized based on holistic and partial cues, thus dynamically adjusting the model's optimization to enhance robust feature learning. Extensive experiments show that our approach outperforms state-of-the-art methods, setting a new standard for robust HOI detection. Benchmarks, datasets, and code are available at https://github.com/KratosWen/RoHOI.
