Table of Contents
Fetching ...

An analysis of HOI: using a training-free method with multimodal visual foundation models when only the test set is available, without the training set

Chaoyi Ai

TL;DR

This paper investigates Human-Object Interaction (HOI) detection under a training-free regime where only the test dataset is available, leveraging multimodal visual foundation models to predict $⟨human, object, verb⟩$ triplets without access to training data. It analyzes three input configurations—paired ground-truth, arbitrarily recombined ground-truth pairs, and unpaired GroundingDINO boxes—to probe the open vocabulary capabilities of current models. The findings indicate that open vocabulary HOI is not yet fully realized in these training-free settings, and results are consistent across seen/unseen and rare/non-rare divisions, with GroundingDINO-based inputs largely corroborating this limitation. The work highlights the persistent gap in zero-shot and few-shot transfer for HOI in multimodal vision foundations, motivating improvements in linguistic guidance and feature alignment for truly training-free inference. The study provides a structured framework for evaluating HOI under extreme data constraints and informs practical expectations for deploying open vocabulary HOI systems without access to training data.

Abstract

Human-Object Interaction (HOI) aims to identify the pairs of humans and objects in images and to recognize their relationships, ultimately forming $\langle human, object, verb \rangle$ triplets. Under default settings, HOI performance is nearly saturated, with many studies focusing on long-tail distribution and zero-shot/few-shot scenarios. Let us consider an intriguing problem:``What if there is only test dataset without training dataset, using multimodal visual foundation model in a training-free manner? '' This study uses two experimental settings: grounding truth and random arbitrary combinations. We get some interesting conclusion and find that the open vocabulary capabilities of the multimodal visual foundation model are not yet fully realized. Additionally, replacing the feature extraction with grounding DINO further confirms these findings.

An analysis of HOI: using a training-free method with multimodal visual foundation models when only the test set is available, without the training set

TL;DR

This paper investigates Human-Object Interaction (HOI) detection under a training-free regime where only the test dataset is available, leveraging multimodal visual foundation models to predict triplets without access to training data. It analyzes three input configurations—paired ground-truth, arbitrarily recombined ground-truth pairs, and unpaired GroundingDINO boxes—to probe the open vocabulary capabilities of current models. The findings indicate that open vocabulary HOI is not yet fully realized in these training-free settings, and results are consistent across seen/unseen and rare/non-rare divisions, with GroundingDINO-based inputs largely corroborating this limitation. The work highlights the persistent gap in zero-shot and few-shot transfer for HOI in multimodal vision foundations, motivating improvements in linguistic guidance and feature alignment for truly training-free inference. The study provides a structured framework for evaluating HOI under extreme data constraints and informs practical expectations for deploying open vocabulary HOI systems without access to training data.

Abstract

Human-Object Interaction (HOI) aims to identify the pairs of humans and objects in images and to recognize their relationships, ultimately forming triplets. Under default settings, HOI performance is nearly saturated, with many studies focusing on long-tail distribution and zero-shot/few-shot scenarios. Let us consider an intriguing problem:``What if there is only test dataset without training dataset, using multimodal visual foundation model in a training-free manner? '' This study uses two experimental settings: grounding truth and random arbitrary combinations. We get some interesting conclusion and find that the open vocabulary capabilities of the multimodal visual foundation model are not yet fully realized. Additionally, replacing the feature extraction with grounding DINO further confirms these findings.
Paper Structure (13 sections, 2 figures, 4 tables)

This paper contains 13 sections, 2 figures, 4 tables.

Figures (2)

  • Figure 1: Our motivation.
  • Figure 2: The model using the paired ground truth.