Table of Contents
Fetching ...

A Unified Benchmark for HOI Evaluation across Vision-Language Models and HOI-Specific Methods

Qinqian Lei, Bo Wang, Robby T. Tan

Abstract

HOI detection has long been dominated by task-specific models, sometimes with early vision-language backbones such as CLIP. With the rise of large generative VLMs, a key question is whether standalone VLMs can perform HOI detection competitively against specialized HOI methods. Existing benchmarks such as HICO-DET require exact label matching under incomplete annotations, so any unmatched prediction is marked wrong. This unfairly penalizes valid outputs, especially from less constrained VLMs, and makes cross-paradigm comparison unreliable. To address this limitation, we introduce CrossHOI-Bench, a multiple-choice HOI benchmark with explicit positives and curated negatives, enabling unified and reliable evaluation of both VLMs and HOI-specific models. We further focus on challenging scenarios, such as multi-person scenes and fine-grained interaction distinctions, which are crucial for revealing real differences between the two paradigms. Experiments show that large VLMs achieve competitive, sometimes superior, zero-shot performance, yet they struggle with multiple concurrent actions and with correctly assigning interactions to the target person. Conversely, HOI-specific methods remain weaker in general HOI reasoning but demonstrate stronger multi-action recognition and more reliable identification of which person performs which action. These findings expose complementary strengths and weaknesses of VLMs and HOI-specific methods, which existing benchmarks fail to reveal due to incorrect penalization.

A Unified Benchmark for HOI Evaluation across Vision-Language Models and HOI-Specific Methods

Abstract

HOI detection has long been dominated by task-specific models, sometimes with early vision-language backbones such as CLIP. With the rise of large generative VLMs, a key question is whether standalone VLMs can perform HOI detection competitively against specialized HOI methods. Existing benchmarks such as HICO-DET require exact label matching under incomplete annotations, so any unmatched prediction is marked wrong. This unfairly penalizes valid outputs, especially from less constrained VLMs, and makes cross-paradigm comparison unreliable. To address this limitation, we introduce CrossHOI-Bench, a multiple-choice HOI benchmark with explicit positives and curated negatives, enabling unified and reliable evaluation of both VLMs and HOI-specific models. We further focus on challenging scenarios, such as multi-person scenes and fine-grained interaction distinctions, which are crucial for revealing real differences between the two paradigms. Experiments show that large VLMs achieve competitive, sometimes superior, zero-shot performance, yet they struggle with multiple concurrent actions and with correctly assigning interactions to the target person. Conversely, HOI-specific methods remain weaker in general HOI reasoning but demonstrate stronger multi-action recognition and more reliable identification of which person performs which action. These findings expose complementary strengths and weaknesses of VLMs and HOI-specific methods, which existing benchmarks fail to reveal due to incorrect penalization.

Paper Structure

This paper contains 37 sections, 4 equations, 52 figures, 24 tables.

Figures (52)

  • Figure 1: (a) Existing HOI benchmarks (e.g., HICO-DET) rely on exact-match evaluation under incomplete annotations, penalizing valid yet unlabeled interactions (b) Our multi-choice benchmark accepts multiple correct answers and avoids false negatives and enabling unified evaluation of HOI-specific methods and VLMs. (c) Comparison of state-of-the-art (InternVL3 InternVL3_2025, Qwen2.5-VL-32B Qwen_2_5_report) and HOI-specific methods (ADA-CM lei2023efficient, CMMP lei2024exploring, HOLa lei2025lhola). Results are shown using Macro-F1 in our benchmark (Setting 1) versus mean Average Precision (mAP) in HICO-DET.
  • Figure 2: (a) HICO-DET test set, showing similar distribution to HICO-DET train set, includes many simple and repetitive scenes for head classes. (b) Comparison between existing HOI benchmarks bongardhoiwang2022learninglin2014microsoftchao2018learning and ours. Percentages indicate the proportion of test images that fall into each scenario type (e.g., single-person single-object, multi-person different HOIs).
  • Figure 3: Overview of our HOI benchmark construction. Input image undergoes coarse screening and manual refinement to produce a four-choice question, followed by evaluation under three settings.
  • Figure 5: Evaluation on our HICO-DET-based, V-COCO-based and SWiG-HOI-based sub-benchmarks in Setting 1 and 2. "InternVL" refers to InternVL3 and "Qwen" refers to "Qwen2.5-VL".
  • Figure 6: Experiment result comparison between our CrossHOI-Bench and full HICO-DET based dataset in Setting 1 and 2. "InternVL" refers to InternVL3 and "Qwen" refers to "Qwen2.5-VL".
  • ...and 47 more figures