B-RIGHT: Benchmark Re-evaluation for Integrity in Generalized Human-Object Interaction Testing
Yoojin Jang, Junsu Kim, Hayeon Kim, Eun-ki Lee, Eun-sol Kim, Seungryul Baek, Jaejun Yoo
TL;DR
B-RIGHT tackles the reliability issue of HOI evaluation by introducing a class-balanced benchmark with 351 HOI classes, where each class has exactly 50 train and 10 test instances, plus a balanced zero-shot set of 10 instances per class. The framework combines a balancing algorithm with retrieval-augmented generation and multi-stage filtering (VLM/LLM) to synthesize high-quality, balanced data, supplemented by real data for evaluation. Re-evaluating state-of-the-art HOI detectors on B-RIGHT reveals substantially reduced per-class AP variance and notable ranking shifts, especially favoring two-stage architectures that decouple detection and interaction classification. This balanced benchmark exposes latent biases, clarifies how architectural choices and pretraining influence generalization, and offers a practical path toward fairer, more robust HOI evaluation in real-world settings.
Abstract
Human-object interaction (HOI) is an essential problem in artificial intelligence (AI) which aims to understand the visual world that involves complex relationships between humans and objects. However, current benchmarks such as HICO-DET face the following limitations: (1) severe class imbalance and (2) varying number of train and test sets for certain classes. These issues can potentially lead to either inflation or deflation of model performance during evaluation, ultimately undermining the reliability of evaluation scores. In this paper, we propose a systematic approach to develop a new class-balanced dataset, Benchmark Re-evaluation for Integrity in Generalized Human-object Interaction Testing (B-RIGHT), that addresses these imbalanced problems. B-RIGHT achieves class balance by leveraging balancing algorithm and automated generation-and-filtering processes, ensuring an equal number of instances for each HOI class. Furthermore, we design a balanced zero-shot test set to systematically evaluate models on unseen scenario. Re-evaluating existing models using B-RIGHT reveals substantial the reduction of score variance and changes in performance rankings compared to conventional HICO-DET. Our experiments demonstrate that evaluation under balanced conditions ensure more reliable and fair model comparisons.
