HICO-DET-SG and V-COCO-SG: New Data Splits for Evaluating the Systematic Generalization Performance of Human-Object Interaction Detection Models

Kentaro Takemoto; Moyuru Yamada; Tomotake Sasaki; Hisanao Akima

HICO-DET-SG and V-COCO-SG: New Data Splits for Evaluating the Systematic Generalization Performance of Human-Object Interaction Detection Models

Kentaro Takemoto, Moyuru Yamada, Tomotake Sasaki, Hisanao Akima

TL;DR

This work introduces two systematic generalization data splits, HICO-DET-SG and V-COCO-SG, for HOI detection to evaluate a model's ability to generalize to novel object–interaction combinations. By training on non-overlapping combinations and testing on unseen pairings, the study reveals significant performance drops across four HOI detectors, illustrating the challenge of compositional generalization. The authors analyze results, showing that model architecture (notably two-stage modular designs) and pretraining influence SG performance, and they propose four directions to improve generalization: diversifying training data, adopting modular architectures, leveraging pretraining, and incorporating natural language resources. The work also provides reproducible SG-split data and code, aiming to spur further research in systematic generalization for HOI detection and related tasks.

Abstract

Human-Object Interaction (HOI) detection is a task to localize humans and objects in an image and predict the interactions in human-object pairs. In real-world scenarios, HOI detection models need systematic generalization, i.e., generalization to novel combinations of objects and interactions, because the train data are expected to cover a limited portion of all possible combinations. To evaluate the systematic generalization performance of HOI detection models, we created two new sets of HOI detection data splits named HICO-DET-SG and V-COCO-SG based on the HICO-DET and V-COCO datasets, respectively. When evaluated on the new data splits, HOI detection models with various characteristics performed much more poorly than when evaluated on the original splits. This shows that systematic generalization is a challenging goal in HOI detection. By analyzing the evaluation results, we also gain insights for improving the systematic generalization performance and identify four possible future research directions. We hope that our new data splits and presented analysis will encourage further research on systematic generalization in HOI detection.

HICO-DET-SG and V-COCO-SG: New Data Splits for Evaluating the Systematic Generalization Performance of Human-Object Interaction Detection Models

TL;DR

Abstract

Paper Structure (25 sections, 5 figures, 4 tables, 1 algorithm)

This paper contains 25 sections, 5 figures, 4 tables, 1 algorithm.

Introduction
Related work
Overview of Human-Object Interaction (HOI) detection
Studies related to systematic generalization in HOI detection
HICO-DET-SG and V-COCO-SG
The creation process of the systematic generalization (SG) splits
Statistics of the HICO-DET-SG and V-COCO-SG
Experimental setups for evaluating HOI detection models
HOI detection models
HOTR.
QPIC.
FGAHOI.
STIP.
Pretraining, hyperparameters, and other conditions
Evaluation results
...and 10 more sections

Figures (5)

Figure 1: Illustration of a data split for evaluating the systematic generalization performance of Human-Object Interaction (HOI) detection models. All images and annotations are selected from HICO-DET-SG split3. The train data consists of combinations such as < human, wash, car> , < human, wash, elephant> , < human, walk, horse> , and < human, straddle, horse> . After trained on such data, an HOI detection model is tested whether it can generalize to novel combinations in the test data such as < human, wash, horse> ⁠. To systematically generalize to such novel combinations, the model must learn the visual cues of the object (in this case, horse) and the interaction (in this case, wash) independently of the specifically paired interaction/object classes in the train data.
Figure 2: Results on HICO-DET-SG
Figure 3: Results on V-COCO-SG
Figure 5: Three failure cases of STIP with the pretrained encoder and decoder after training and testing on HICO-DET-SG split3. (a) An example of predicting the wrong interaction class. The model predicted the interaction as straddle, although the correct class is wash. (b) An example of detecting the wrong object. The model predicted an irrelevant region as a wrong class bench, although it should detect a bed under the person. (c) An example of wrong class prediction of both object and interaction. The model predicted < human, hit, baseball bat> triplet although the correct answer is < human, swing, tennis racket> .
Figure : Creation of the systematic generalization (SG) splits.

HICO-DET-SG and V-COCO-SG: New Data Splits for Evaluating the Systematic Generalization Performance of Human-Object Interaction Detection Models

TL;DR

Abstract

HICO-DET-SG and V-COCO-SG: New Data Splits for Evaluating the Systematic Generalization Performance of Human-Object Interaction Detection Models

Authors

TL;DR

Abstract

Table of Contents

Figures (5)