Table of Contents
Fetching ...

B-RIGHT: Benchmark Re-evaluation for Integrity in Generalized Human-Object Interaction Testing

Yoojin Jang, Junsu Kim, Hayeon Kim, Eun-ki Lee, Eun-sol Kim, Seungryul Baek, Jaejun Yoo

TL;DR

B-RIGHT tackles the reliability issue of HOI evaluation by introducing a class-balanced benchmark with 351 HOI classes, where each class has exactly 50 train and 10 test instances, plus a balanced zero-shot set of 10 instances per class. The framework combines a balancing algorithm with retrieval-augmented generation and multi-stage filtering (VLM/LLM) to synthesize high-quality, balanced data, supplemented by real data for evaluation. Re-evaluating state-of-the-art HOI detectors on B-RIGHT reveals substantially reduced per-class AP variance and notable ranking shifts, especially favoring two-stage architectures that decouple detection and interaction classification. This balanced benchmark exposes latent biases, clarifies how architectural choices and pretraining influence generalization, and offers a practical path toward fairer, more robust HOI evaluation in real-world settings.

Abstract

Human-object interaction (HOI) is an essential problem in artificial intelligence (AI) which aims to understand the visual world that involves complex relationships between humans and objects. However, current benchmarks such as HICO-DET face the following limitations: (1) severe class imbalance and (2) varying number of train and test sets for certain classes. These issues can potentially lead to either inflation or deflation of model performance during evaluation, ultimately undermining the reliability of evaluation scores. In this paper, we propose a systematic approach to develop a new class-balanced dataset, Benchmark Re-evaluation for Integrity in Generalized Human-object Interaction Testing (B-RIGHT), that addresses these imbalanced problems. B-RIGHT achieves class balance by leveraging balancing algorithm and automated generation-and-filtering processes, ensuring an equal number of instances for each HOI class. Furthermore, we design a balanced zero-shot test set to systematically evaluate models on unseen scenario. Re-evaluating existing models using B-RIGHT reveals substantial the reduction of score variance and changes in performance rankings compared to conventional HICO-DET. Our experiments demonstrate that evaluation under balanced conditions ensure more reliable and fair model comparisons.

B-RIGHT: Benchmark Re-evaluation for Integrity in Generalized Human-Object Interaction Testing

TL;DR

B-RIGHT tackles the reliability issue of HOI evaluation by introducing a class-balanced benchmark with 351 HOI classes, where each class has exactly 50 train and 10 test instances, plus a balanced zero-shot set of 10 instances per class. The framework combines a balancing algorithm with retrieval-augmented generation and multi-stage filtering (VLM/LLM) to synthesize high-quality, balanced data, supplemented by real data for evaluation. Re-evaluating state-of-the-art HOI detectors on B-RIGHT reveals substantially reduced per-class AP variance and notable ranking shifts, especially favoring two-stage architectures that decouple detection and interaction classification. This balanced benchmark exposes latent biases, clarifies how architectural choices and pretraining influence generalization, and offers a practical path toward fairer, more robust HOI evaluation in real-world settings.

Abstract

Human-object interaction (HOI) is an essential problem in artificial intelligence (AI) which aims to understand the visual world that involves complex relationships between humans and objects. However, current benchmarks such as HICO-DET face the following limitations: (1) severe class imbalance and (2) varying number of train and test sets for certain classes. These issues can potentially lead to either inflation or deflation of model performance during evaluation, ultimately undermining the reliability of evaluation scores. In this paper, we propose a systematic approach to develop a new class-balanced dataset, Benchmark Re-evaluation for Integrity in Generalized Human-object Interaction Testing (B-RIGHT), that addresses these imbalanced problems. B-RIGHT achieves class balance by leveraging balancing algorithm and automated generation-and-filtering processes, ensuring an equal number of instances for each HOI class. Furthermore, we design a balanced zero-shot test set to systematically evaluate models on unseen scenario. Re-evaluating existing models using B-RIGHT reveals substantial the reduction of score variance and changes in performance rankings compared to conventional HICO-DET. Our experiments demonstrate that evaluation under balanced conditions ensure more reliable and fair model comparisons.

Paper Structure

This paper contains 56 sections, 1 equation, 12 figures, 7 tables, 1 algorithm.

Figures (12)

  • Figure 1: (a) Example images of B-RIGHT. The proposed dataset ensures a uniform distribution of 351 HOI categories, with 50, 10, and 10 instances for train sets, test sets, and zero-shot evaluation, respectively. (b) Ranking shifts between HICO-DET and B-RIGHT. Circle sizes indicate the variance in class-wise AP scores within each detector, while arrows and numbers denote ranking shifts.
  • Figure 2: Problem analysis for HOI classes on HICO-DET: (a) long tail train / test set, (b) varying number of train / test set. The HOI class distribution shows that common classes ① and ② are likely to be well-represented in the real world. However, as we move towards the tail, classes become very rare. Note that for classes in the extreme tail, like ③ and ④, they become particularly rare or ambiguous, highlighting the inherent limitations of HOI detection problem.
  • Figure 3: Impact of flipping a single TP instance to FP instance for two classes with similar initial AP scores and train set sizes but different test set sizes. We label the class with fewer test instances as Less (orange) and the class with more instances as Many (green). Each circle and hexagon represent the original AP and the perturbed AP, respectively, with the numbers below each symbol indicating the AP. The arrows and numbers denote the percentage decrease from the original AP to the perturbed AP.
  • Figure 4: Overview of our generation-and-filtering schemes. (a) Retrieval-augmented generation: we retrieve an image from HICO-DET and use a VLM to form a descriptive prompt in a predefined template, which SDXL then uses to create a synthetic image. (b) Filtering process: An open-world detector identifies all people and objects, after which another VLM and an LLM verify whether the image correctly depicts the target HOI. Images that do not pass this verification are discarded, and the original prompts are paraphrased to generate new images until we collect enough valid samples or reach our generation limit.
  • Figure 5: Example of the generated and crawled images with (verb, object) pairs after our augmented process.
  • ...and 7 more figures