Disentangled Pre-training for Human-Object Interaction Detection

Zhuolong Li; Xingao Li; Changxing Ding; Xiangmin Xu

Disentangled Pre-training for Human-Object Interaction Detection

Zhuolong Li, Xingao Li, Changxing Ding, Xiangmin Xu

TL;DR

DP-HOI tackles data scarcity in HOI detection by disentangling pre-training into object-detection and verb-classification branches that leverage large-scale datasets for each sub-task. The object branch follows DETR for robust object localization, while the verb branch uses reliable person queries from the detection branch to perform verb classification, augmented by verb-wise fusion and extensions to video and caption data with contrastive alignment. This approach yields consistent improvements on HICO-DET and V-COCO, especially in rare and zero-shot settings, and demonstrates strong transfer to existing HOI detectors with less pseudo-labeling noise. The method offers practically significant gains for HOI understanding and is complemented by available code and pre-trained weights, though it requires substantial GPU memory for pre-training.

Abstract

Detecting human-object interaction (HOI) has long been limited by the amount of supervised data available. Recent approaches address this issue by pre-training according to pseudo-labels, which align object regions with HOI triplets parsed from image captions. However, pseudo-labeling is tricky and noisy, making HOI pre-training a complex process. Therefore, we propose an efficient disentangled pre-training method for HOI detection (DP-HOI) to address this problem. First, DP-HOI utilizes object detection and action recognition datasets to pre-train the detection and interaction decoder layers, respectively. Then, we arrange these decoder layers so that the pre-training architecture is consistent with the downstream HOI detection task. This facilitates efficient knowledge transfer. Specifically, the detection decoder identifies reliable human instances in each action recognition dataset image, generates one corresponding query, and feeds it into the interaction decoder for verb classification. Next, we combine the human instance verb predictions in the same image and impose image-level supervision. The DP-HOI structure can be easily adapted to the HOI detection task, enabling effective model parameter initialization. Therefore, it significantly enhances the performance of existing HOI detection models on a broad range of rare categories. The code and pre-trained weight are available at https://github.com/xingaoli/DP-HOI.

Disentangled Pre-training for Human-Object Interaction Detection

TL;DR

Abstract

Paper Structure (24 sections, 9 equations, 4 figures, 10 tables)

This paper contains 24 sections, 9 equations, 4 figures, 10 tables.

Introduction
Related Work
Human-Object Interaction Detection
Pre-training Methods for Detection Tasks
Methods
Overview
The Object Detection Branch
The Verb Classification Branch
Extension to Video and Caption Data
Overall Loss Function
Experiments
The Pre-training Datasets
The HOI Detection Datasets
Implementation Details
Comparisons with State-of-the-Art Methods
...and 9 more sections

Figures (4)

Figure 1: CDN-S zhang2021mining mAP and convergence curves with the pre-trained DETR weights carion2020end on MS-COCO lin2014microsoft and our DP-HOI, respectively. DN denotes the denoising strategy li2022dn is adopted to speed up convergence. Experiments are conducted on the HICO-DET dataset chao2018learning.
Figure 2: Our DP-HOI framework overview. It includes a CNN backbone, a transformer encoder, an object detection branch, and a verb classification branch. The two branches are trained in a disentangled manner, with labeled databases for object detection and action recognition, respectively. Each training image from the action recognition dataset first passes the detection decoder, identifies reliable human instances, and generates reliable person queries (RPQs) for the interaction decoder. Then, each RPQ is responsible for searching for relevant action cues for the specified human instance. Since we only have image-level action labels, we impose supervision on the fused RPQs predictions.
Figure 3: Visualization of the attention maps in the decoder layers. The two rows represent results for the detection and interaction decoders, respectively.
Figure 4: Visualization of obtained HOI triplets on Flickr30k. Each column indicates an image and its obtained HOI triplets in Flickr30k.

Disentangled Pre-training for Human-Object Interaction Detection

TL;DR

Abstract

Disentangled Pre-training for Human-Object Interaction Detection

Authors

TL;DR

Abstract

Table of Contents

Figures (4)