An Image-like Diffusion Method for Human-Object Interaction Detection
Xiaofei Hui, Haoxuan Qu, Hossein Rahmani, Jun Liu
TL;DR
This work reframes human-object interaction detection as an HOI image generation problem and introduces HOI-IDiff, a diffusion-based framework with a customized forward diffusion process and a slice patchification transformer to generate HOI images of shape $H \times W \times 2$, where each vertical slice is a joint distribution over object and interaction probabilities. By initializing the diffusion from detector-derived priors and enforcing distribution-consistent forward steps, the method produces high-quality HOI images that enable accurate HOI triplet predictions. The approach achieves state-of-the-art results on HICO-DET and V-COCO, with ablations validating the importance of diffusion customization, slice-based architecture, and the joint formulation of object categories and interactions. This work demonstrates the practical impact of viewing HOI detection through the lens of image generation, leveraging diffusion models to resolve indeterminacy in HOI predictions.
Abstract
Human-object interaction (HOI) detection often faces high levels of ambiguity and indeterminacy, as the same interaction can appear vastly different across different human-object pairs. Additionally, the indeterminacy can be further exacerbated by issues such as occlusions and cluttered backgrounds. To handle such a challenging task, in this work, we begin with a key observation: the output of HOI detection for each human-object pair can be recast as an image. Thus, inspired by the strong image generation capabilities of image diffusion models, we propose a new framework, HOI-IDiff. In HOI-IDiff, we tackle HOI detection from a novel perspective, using an Image-like Diffusion process to generate HOI detection outputs as images. Furthermore, recognizing that our recast images differ in certain properties from natural images, we enhance our framework with a customized HOI diffusion process and a slice patchification model architecture, which are specifically tailored to generate our recast ``HOI images''. Extensive experiments demonstrate the efficacy of our framework.
