Towards Zero-shot Human-Object Interaction Detection via Vision-Language Integration

Weiying Xue; Qi Liu; Qiwei Xiong; Yuxiao Wang; Zhenao Wei; Xiaofen Xing; Xiangmin Xu

Towards Zero-shot Human-Object Interaction Detection via Vision-Language Integration

Weiying Xue, Qi Liu, Qiwei Xiong, Yuxiao Wang, Zhenao Wei, Xiaofen Xing, Xiangmin Xu

TL;DR

A novel framework, termed Knowledge Integration to HOI (KI2HOI), that effectively integrates the knowledge of visual-language model to improve zero-shot HOI detection and develops an effective additive self-attention mechanism to generate more comprehensive visual representations.

Abstract

Human-object interaction (HOI) detection aims to locate human-object pairs and identify their interaction categories in images. Most existing methods primarily focus on supervised learning, which relies on extensive manual HOI annotations. In this paper, we propose a novel framework, termed Knowledge Integration to HOI (KI2HOI), that effectively integrates the knowledge of visual-language model to improve zero-shot HOI detection. Specifically, the verb feature learning module is designed based on visual semantics, by employing the verb extraction decoder to convert corresponding verb queries into interaction-specific category representations. We develop an effective additive self-attention mechanism to generate more comprehensive visual representations. Moreover, the innovative interaction representation decoder effectively extracts informative regions by integrating spatial and visual feature information through a cross-attention mechanism. To deal with zero-shot learning in low-data, we leverage a priori knowledge from the CLIP text encoder to initialize the linear classifier for enhanced interaction understanding. Extensive experiments conducted on the mainstream HICO-DET and V-COCO datasets demonstrate that our model outperforms the previous methods in various zero-shot and full-supervised settings.

Towards Zero-shot Human-Object Interaction Detection via Vision-Language Integration

TL;DR

Abstract

Paper Structure (18 sections, 18 equations, 6 figures, 6 tables)

This paper contains 18 sections, 18 equations, 6 figures, 6 tables.

INTRODUCTION
RELATED WORKS
Human-Object Interaction Detection
Vision-and-Language Pre-training
Zero-shot HOI
METHODS
Overall Architecture
Visual Encoder
Verb Feature Learning
Interaction Semantic Representation (ISR)
Training and Inference
EXPERIMENTS
Experimental Setup
Effectiveness for HOI Detection
Ablation Studies
...and 3 more sections

Figures (6)

Figure 1: Comparison of HOI detection. Conventional HOI detection required manually annotated datasets for training. Previous HOI detection with language model (LM) employed limited knowledge distillation to visual detectors, but it is limited to handling potential interactions among unseen human-object pairs. Our model fully leverages visual-language model (VLM) and verb queries for effective knowledge integration, to promote unseen interaction recognition.
Figure 2: Overview of KI2HOI pipeline. It consists of four parts: visual encoder, verb feature learning, instance interactor, and interaction semantic representation (ISR). Given an image, firstly, we obtain the feature map through the backbone and then use our dedicated visual encoder to extract contextual global features. The instance interactor injects CLIP spatial information and global features to locate human-object pairs and classify object categories. In the verb feature learning module, associated verb queries are fed to the verb extraction decoder to obtain fine-grained verb features. The interaction semantic representation model inputs the verb features and the interaction features from encoders to extract the interaction representation.
Figure 3: Structure of Ho-Pair Encoder. The local encoder is specifically engineered to encode efficient local characteristics, followed by $3\times3$ depth-wise convolution and two $1\times1$ convolutions for channel blending. The global context former is intended to capture comprehensive local-global representations by extracting local features from the local convolutional layers, an efficient additive attention module, and linear layers.
Figure 4: Structure of interaction representation decoder. Interaction queries $\bm{Q}_{inter}$, visual-spatial features $\bm{V}_{sp}$, and global visual features$\bm{V}_{G}$ are passed through the multi-head cross-attention block before being fed into the multi-head self-attention block. Then, the outputs are concatenated with the verb features $\bm{V}_{verb}$ before being fed into the feed-forward network.
Figure 5: Visualization of the HOI detection results. From left to right, column 1: HOI prediction results; column 2: attention maps from Verb Extraction Decoder; column 3: attention maps from Interaction Representation Decoder. Images are sampled from the HICO-DET dataset in UV test set.
...and 1 more figures

Towards Zero-shot Human-Object Interaction Detection via Vision-Language Integration

TL;DR

Abstract

Towards Zero-shot Human-Object Interaction Detection via Vision-Language Integration

Authors

TL;DR

Abstract

Table of Contents

Figures (6)