Table of Contents
Fetching ...

Focusing on what to decode and what to train: SOV Decoding with Specific Target Guided DeNoising and Vision Language Advisor

Junwen Chen, Yingcheng Wang, Keiji Yanai

TL;DR

This work addresses the slow convergence and entangled decoding in transformer-based HOI detection by introducing Subject-Object-Verb (SOV) decoding, Specific Target Guided (STG) denoising, and a Vision-Language Advisor (VLA). The SOV framework explicitly separates subject and object localization from verb recognition, while STG injects ground-truth priors into label embeddings and denoising queries to accelerate training. The VLA fuses global VLM knowledge via a Vision Advisor and a Verb-HOI Bridge to align verb and HOI predictions with language priors. Empirically, the approach achieves state-of-the-art results on HICO-DET and V-COCO, with significantly faster convergence (e.g., 15 epochs for Swin-L with VLA) and strong ablations validating each component.

Abstract

Recent transformer-based methods achieve notable gains in the Human-object Interaction Detection (HOID) task by leveraging the detection of DETR and the prior knowledge of Vision-Language Model (VLM). However, these methods suffer from extended training times and complex optimization due to the entanglement of object detection and HOI recognition during the decoding process. Especially, the query embeddings used to predict both labels and boxes suffer from ambiguous representations, and the gap between the prediction of HOI labels and verb labels is not considered. To address these challenges, we introduce SOV-STG-VLA with three key components: Subject-Object-Verb (SOV) decoding, Specific Target Guided (STG) denoising, and a Vision-Language Advisor (VLA). Our SOV decoders disentangle object detection and verb recognition with a novel interaction region representation. The STG denoising strategy learns label embeddings with ground-truth information to guide the training and inference. Our SOV-STG achieves a fast convergence speed and high accuracy and builds a foundation for the VLA to incorporate the prior knowledge of the VLM. We introduce a vision advisor decoder to fuse both the interaction region information and the VLM's vision knowledge and a Verb-HOI prediction bridge to promote interaction representation learning. Our VLA notably improves our SOV-STG and achieves SOTA performance with one-sixth of training epochs compared to recent SOTA. Code and models are available at https://github.com/cjw2021/SOV-STG-VLA

Focusing on what to decode and what to train: SOV Decoding with Specific Target Guided DeNoising and Vision Language Advisor

TL;DR

This work addresses the slow convergence and entangled decoding in transformer-based HOI detection by introducing Subject-Object-Verb (SOV) decoding, Specific Target Guided (STG) denoising, and a Vision-Language Advisor (VLA). The SOV framework explicitly separates subject and object localization from verb recognition, while STG injects ground-truth priors into label embeddings and denoising queries to accelerate training. The VLA fuses global VLM knowledge via a Vision Advisor and a Verb-HOI Bridge to align verb and HOI predictions with language priors. Empirically, the approach achieves state-of-the-art results on HICO-DET and V-COCO, with significantly faster convergence (e.g., 15 epochs for Swin-L with VLA) and strong ablations validating each component.

Abstract

Recent transformer-based methods achieve notable gains in the Human-object Interaction Detection (HOID) task by leveraging the detection of DETR and the prior knowledge of Vision-Language Model (VLM). However, these methods suffer from extended training times and complex optimization due to the entanglement of object detection and HOI recognition during the decoding process. Especially, the query embeddings used to predict both labels and boxes suffer from ambiguous representations, and the gap between the prediction of HOI labels and verb labels is not considered. To address these challenges, we introduce SOV-STG-VLA with three key components: Subject-Object-Verb (SOV) decoding, Specific Target Guided (STG) denoising, and a Vision-Language Advisor (VLA). Our SOV decoders disentangle object detection and verb recognition with a novel interaction region representation. The STG denoising strategy learns label embeddings with ground-truth information to guide the training and inference. Our SOV-STG achieves a fast convergence speed and high accuracy and builds a foundation for the VLA to incorporate the prior knowledge of the VLM. We introduce a vision advisor decoder to fuse both the interaction region information and the VLM's vision knowledge and a Verb-HOI prediction bridge to promote interaction representation learning. Our VLA notably improves our SOV-STG and achieves SOTA performance with one-sixth of training epochs compared to recent SOTA. Code and models are available at https://github.com/cjw2021/SOV-STG-VLA
Paper Structure (12 sections, 7 equations, 7 figures, 6 tables)

This paper contains 12 sections, 7 equations, 7 figures, 6 tables.

Figures (7)

  • Figure 1: End-to-end training pipeline of our SOV-STG. Our SOV framework splits the decoding process into three parts for each element of the HOI instance. Our STG training strategy efficiently transfers the ground-truth information to label embeddings through additional denoising queries.
  • Figure 2: Comparison of the training convergence curves of the state-of-the-art methods on the HICO-DET dataset.
  • Figure 3: The inference pipeline of SOV-STG-VLA. SOV-STG consists of three parts: the STG label priors initialization, the subject and object detection, and the verb recognition. The label embeddings learned by our STG training strategy are used to initialize the label queries $\bm{Q}_{ov}$. The subject and object decoders update the learnable anchor boxes $B_s$ and $B_v$ to predict the subject and object, and the verb boxes $B_v$ are generated by our adaptive shifted MBR. Our SOV-STG-VLA is built on the SOV-STG framework. VLA enriches the expression of the verb embeddings $E_v$ by Vision Advisor with the global context information from the feature extractor and the pretrained VLM and the spatial information from the verb box. Then, V-HOI Bridge connects the prediction of HOI labels and verb labels.
  • Figure 4: The illustration of the S-O attention module.
  • Figure 5: Illustration of adding noise to a ground-truth HOI instance. The initialization consists of two parts, the object label and the verb label DN queries initialization. The final DN query embeddings $\bm{q}^{dn}_{k}$ are concatenated with the object label DN queries $\bm{q}^{\tilde{o}}_{k}$ and the verb label DN queries $\bm{q}^{\tilde{v}}_{k}$.
  • ...and 2 more figures