Table of Contents
Fetching ...

An Image-like Diffusion Method for Human-Object Interaction Detection

Xiaofei Hui, Haoxuan Qu, Hossein Rahmani, Jun Liu

TL;DR

This work reframes human-object interaction detection as an HOI image generation problem and introduces HOI-IDiff, a diffusion-based framework with a customized forward diffusion process and a slice patchification transformer to generate HOI images of shape $H \times W \times 2$, where each vertical slice is a joint distribution over object and interaction probabilities. By initializing the diffusion from detector-derived priors and enforcing distribution-consistent forward steps, the method produces high-quality HOI images that enable accurate HOI triplet predictions. The approach achieves state-of-the-art results on HICO-DET and V-COCO, with ablations validating the importance of diffusion customization, slice-based architecture, and the joint formulation of object categories and interactions. This work demonstrates the practical impact of viewing HOI detection through the lens of image generation, leveraging diffusion models to resolve indeterminacy in HOI predictions.

Abstract

Human-object interaction (HOI) detection often faces high levels of ambiguity and indeterminacy, as the same interaction can appear vastly different across different human-object pairs. Additionally, the indeterminacy can be further exacerbated by issues such as occlusions and cluttered backgrounds. To handle such a challenging task, in this work, we begin with a key observation: the output of HOI detection for each human-object pair can be recast as an image. Thus, inspired by the strong image generation capabilities of image diffusion models, we propose a new framework, HOI-IDiff. In HOI-IDiff, we tackle HOI detection from a novel perspective, using an Image-like Diffusion process to generate HOI detection outputs as images. Furthermore, recognizing that our recast images differ in certain properties from natural images, we enhance our framework with a customized HOI diffusion process and a slice patchification model architecture, which are specifically tailored to generate our recast ``HOI images''. Extensive experiments demonstrate the efficacy of our framework.

An Image-like Diffusion Method for Human-Object Interaction Detection

TL;DR

This work reframes human-object interaction detection as an HOI image generation problem and introduces HOI-IDiff, a diffusion-based framework with a customized forward diffusion process and a slice patchification transformer to generate HOI images of shape , where each vertical slice is a joint distribution over object and interaction probabilities. By initializing the diffusion from detector-derived priors and enforcing distribution-consistent forward steps, the method produces high-quality HOI images that enable accurate HOI triplet predictions. The approach achieves state-of-the-art results on HICO-DET and V-COCO, with ablations validating the importance of diffusion customization, slice-based architecture, and the joint formulation of object categories and interactions. This work demonstrates the practical impact of viewing HOI detection through the lens of image generation, leveraging diffusion models to resolve indeterminacy in HOI predictions.

Abstract

Human-object interaction (HOI) detection often faces high levels of ambiguity and indeterminacy, as the same interaction can appear vastly different across different human-object pairs. Additionally, the indeterminacy can be further exacerbated by issues such as occlusions and cluttered backgrounds. To handle such a challenging task, in this work, we begin with a key observation: the output of HOI detection for each human-object pair can be recast as an image. Thus, inspired by the strong image generation capabilities of image diffusion models, we propose a new framework, HOI-IDiff. In HOI-IDiff, we tackle HOI detection from a novel perspective, using an Image-like Diffusion process to generate HOI detection outputs as images. Furthermore, recognizing that our recast images differ in certain properties from natural images, we enhance our framework with a customized HOI diffusion process and a slice patchification model architecture, which are specifically tailored to generate our recast ``HOI images''. Extensive experiments demonstrate the efficacy of our framework.

Paper Structure

This paper contains 12 sections, 7 equations, 3 figures, 3 tables.

Figures (3)

  • Figure 1: Illustration of the HOI image (of shape $H \times W \times 2$) formed over a human-object pair in (a) in which the human is watching and pushing the box. In this figure, we illustrate the case in which $H = 5$ and $W = 6$. We use a black pixel to represent a pixel with value 0, and a white pixel to represent a pixel with value 1. As shown, the HOI image$I^{hoi}$ in (d) is formed as the product of the vector $v^{obj}$ in (b) and the matrix $m^{int}$ in (c), in which $v^{obj}$ (of shape $H$) represents the object classification result, and $m^{int}$ (of shape $W \times 2$) represents the interaction prediction result.
  • Figure 2: Illustration of our HOI image diffusion process. As indicated by the red arrows from right to left, the forward HOI image diffusion process gradually diffuses the ground-truth HOI image$I^{hoi}_0$ towards $I^{hoi}_K$ (i.e., $I^{hoi}_0 \rightarrow ... \rightarrow I^{hoi}_{K-1} \rightarrow I^{hoi}_{K}$). Conversely, as shown by the green arrows from left to right, in the reverse HOI image diffusion process, conditioned on the appearance feature $f_a$, the diffusion model $\theta$ is guided to progressively reconstruct a desired high-quality HOI image$\hat{I}^{hoi}_0$ from $I^{hoi}_K$ (i.e., $I^{hoi}_K \rightarrow \hat{I}^{hoi}_{K-1} \rightarrow ... \rightarrow \hat{I}^{hoi}_0$).
  • Figure 3: Visualization of the HOI image diffusion process.