Table of Contents
Fetching ...

Unseen No More: Unlocking the Potential of CLIP for Generative Zero-shot HOI Detection

Yixin Guo, Yu Liu, Jianghao Li, Weimin Wang, Qi Jia

TL;DR

This work addresses zero-shot HOI detection by tackling the seen-unseen bias that hampers CLIP-based embedding methods. It introduces HOIGen, a generation-based framework that uses a CLIP-injected VAE to synthesize human, object, and union features, training with both real and synthetic data. The model deploys two HOI recognition branches—pairwise and image-wise—coupled with a generative prototype bank and a multi-knowledge prototype bank to produce robust scores for seen and unseen HOIs, achieving state-of-the-art results on HICO-DET. The approach reduces seen-unseen confusion and demonstrates the practical potential of CLIP-driven feature generation for open-domain HOI understanding.

Abstract

Zero-shot human-object interaction (HOI) detector is capable of generalizing to HOI categories even not encountered during training. Inspired by the impressive zero-shot capabilities offered by CLIP, latest methods strive to leverage CLIP embeddings for improving zero-shot HOI detection. However, these embedding-based methods train the classifier on seen classes only, inevitably resulting in seen-unseen confusion for the model during inference. Besides, we find that using prompt-tuning and adapters further increases the gap between seen and unseen accuracy. To tackle this challenge, we present the first generation-based model using CLIP for zero-shot HOI detection, coined HOIGen. It allows to unlock the potential of CLIP for feature generation instead of feature extraction only. To achieve it, we develop a CLIP-injected feature generator in accordance with the generation of human, object and union features. Then, we extract realistic features of seen samples and mix them with synthetic features together, allowing the model to train seen and unseen classes jointly. To enrich the HOI scores, we construct a generative prototype bank in a pairwise HOI recognition branch, and a multi-knowledge prototype bank in an image-wise HOI recognition branch, respectively. Extensive experiments on HICO-DET benchmark demonstrate our HOIGen achieves superior performance for both seen and unseen classes under various zero-shot settings, compared with other top-performing methods. Code is available at: https://github.com/soberguo/HOIGen

Unseen No More: Unlocking the Potential of CLIP for Generative Zero-shot HOI Detection

TL;DR

This work addresses zero-shot HOI detection by tackling the seen-unseen bias that hampers CLIP-based embedding methods. It introduces HOIGen, a generation-based framework that uses a CLIP-injected VAE to synthesize human, object, and union features, training with both real and synthetic data. The model deploys two HOI recognition branches—pairwise and image-wise—coupled with a generative prototype bank and a multi-knowledge prototype bank to produce robust scores for seen and unseen HOIs, achieving state-of-the-art results on HICO-DET. The approach reduces seen-unseen confusion and demonstrates the practical potential of CLIP-driven feature generation for open-domain HOI understanding.

Abstract

Zero-shot human-object interaction (HOI) detector is capable of generalizing to HOI categories even not encountered during training. Inspired by the impressive zero-shot capabilities offered by CLIP, latest methods strive to leverage CLIP embeddings for improving zero-shot HOI detection. However, these embedding-based methods train the classifier on seen classes only, inevitably resulting in seen-unseen confusion for the model during inference. Besides, we find that using prompt-tuning and adapters further increases the gap between seen and unseen accuracy. To tackle this challenge, we present the first generation-based model using CLIP for zero-shot HOI detection, coined HOIGen. It allows to unlock the potential of CLIP for feature generation instead of feature extraction only. To achieve it, we develop a CLIP-injected feature generator in accordance with the generation of human, object and union features. Then, we extract realistic features of seen samples and mix them with synthetic features together, allowing the model to train seen and unseen classes jointly. To enrich the HOI scores, we construct a generative prototype bank in a pairwise HOI recognition branch, and a multi-knowledge prototype bank in an image-wise HOI recognition branch, respectively. Extensive experiments on HICO-DET benchmark demonstrate our HOIGen achieves superior performance for both seen and unseen classes under various zero-shot settings, compared with other top-performing methods. Code is available at: https://github.com/soberguo/HOIGen
Paper Structure (19 sections, 7 equations, 6 figures, 9 tables)

This paper contains 19 sections, 7 equations, 6 figures, 9 tables.

Figures (6)

  • Figure 1: Differences between existing embedding-based paradigm and our generation-based paradigm for zero-shot HOI detection. The former exploits CLIP to train visual and semantic embeddings of seen HOI categories only. Beyond that, our generation-based paradigm develops a new CLIP-injected feature generation module given either seen or unseen class names. The generated features enable the model to train seen and unseen HOI categories jointly. Besides, we construct generative prototype bank and multi-knowledge prototype bank to enrich the HOI scores.
  • Figure 2: Comparison of our method and ADA-CM on unseen and seen categories of HICO-DET dataset, under Non-rare First Unseen Combination (NF-UC) setting.
  • Figure 3: Overview of the proposed HOIGen model, which comprises CLIP-injected feature generation, pairwise HOI recognition and image-wise HOI recognition. We contribute CLIP image-text encoders to a variational auto-encoder, which synthesizes image features in a two-stage fashion. The pairwise HOI recognition branch utilizes CLIP image features in conjunction with the bounding boxes obtained from a pre-trained DETR. The resulting features are fed into a generative prototype bank for computing pair-wise HOI scores. On the other hand, the image-wise HOI recognition branch is responsible for extracting global features by combining CLIP and DINO encoders, constructing a multi-knowledge prototype bank for image-wise HOI scores. Finally, the scores from the two branches are combined to predict the HOI category.
  • Figure 4: Visualization of realistic (light regions) and synthesized features (dark regions) using t-SNE, with respect to unseen HOI categories from HICO-DET dataset hicodet. We synthesize 100 features per category. The left shows the feature distributions of different pairs of <ACTION, OBJECT>, and the right presents the features of different objects.
  • Figure 5: Construction of generative prototype bank.
  • ...and 1 more figures