Table of Contents
Fetching ...

VLM-HOI: Vision Language Models for Interpretable Human-Object Interaction Analysis

Donggoo Kang, Dasol Jeong, Hyunmin Lee, Sangwoo Park, Hasil Park, Sunkyu Kwon, Yeongjoon Kim, Joonki Paik

TL;DR

A novel approach that explicitly utilizes VLM as an objective function form for the Human-Object Interaction (HOI) detection task (VLM-HOI) using the Image-Text matching technique, and is believed to be the first utilization of VLM language abilities for HOI detection.

Abstract

The Large Vision Language Model (VLM) has recently addressed remarkable progress in bridging two fundamental modalities. VLM, trained by a sufficiently large dataset, exhibits a comprehensive understanding of both visual and linguistic to perform diverse tasks. To distill this knowledge accurately, in this paper, we introduce a novel approach that explicitly utilizes VLM as an objective function form for the Human-Object Interaction (HOI) detection task (\textbf{VLM-HOI}). Specifically, we propose a method that quantifies the similarity of the predicted HOI triplet using the Image-Text matching technique. We represent HOI triplets linguistically to fully utilize the language comprehension of VLMs, which are more suitable than CLIP models due to their localization and object-centric nature. This matching score is used as an objective for contrastive optimization. To our knowledge, this is the first utilization of VLM language abilities for HOI detection. Experiments demonstrate the effectiveness of our method, achieving state-of-the-art HOI detection accuracy on benchmarks. We believe integrating VLMs into HOI detection represents important progress towards more advanced and interpretable analysis of human-object interactions.

VLM-HOI: Vision Language Models for Interpretable Human-Object Interaction Analysis

TL;DR

A novel approach that explicitly utilizes VLM as an objective function form for the Human-Object Interaction (HOI) detection task (VLM-HOI) using the Image-Text matching technique, and is believed to be the first utilization of VLM language abilities for HOI detection.

Abstract

The Large Vision Language Model (VLM) has recently addressed remarkable progress in bridging two fundamental modalities. VLM, trained by a sufficiently large dataset, exhibits a comprehensive understanding of both visual and linguistic to perform diverse tasks. To distill this knowledge accurately, in this paper, we introduce a novel approach that explicitly utilizes VLM as an objective function form for the Human-Object Interaction (HOI) detection task (\textbf{VLM-HOI}). Specifically, we propose a method that quantifies the similarity of the predicted HOI triplet using the Image-Text matching technique. We represent HOI triplets linguistically to fully utilize the language comprehension of VLMs, which are more suitable than CLIP models due to their localization and object-centric nature. This matching score is used as an objective for contrastive optimization. To our knowledge, this is the first utilization of VLM language abilities for HOI detection. Experiments demonstrate the effectiveness of our method, achieving state-of-the-art HOI detection accuracy on benchmarks. We believe integrating VLMs into HOI detection represents important progress towards more advanced and interpretable analysis of human-object interactions.

Paper Structure

This paper contains 18 sections, 6 equations, 5 figures, 5 tables.

Figures (5)

  • Figure 1: Comparison of Image-Text Matching scores and CLIP similarity for various HOI triplets. While CLIP struggles to capture semantic relationships due to its reliance on simple prompts, VLM models effectively distinguish between positive and negative HOI triplets despite not receiving complete sentences as input.
  • Figure 2: Visualization of attention maps for BLIPblip and CLIPclip models. Both models exhibit accurate localization in the first example. However, in the second example, CLIP focuses solely on individual word locations, failing to capture the broader context of the input sentence.
  • Figure 3: The overview of our proposed VLM-HOI. The network consists of a DETR-based encoder and a query-based transformer decoder. Predicted HOI triplets are matched positive and negative Then these sets are converted into text form. The image-text matching task of VLM computes the matching score of these text sets.
  • Figure 4: Comparison of Baselinemuren and proposed method with example queries. Given a verb query $q^v_i$, we visualize the top two most confident predictions, including their corresponding activation maps and bounding boxes.
  • Figure 5: Analysis of Image-Text Matching Scores on HOI Detection Benchmarks. This figure visualizes the image-text similarity scores computed between visual input and corresponding grounded sentence prompts.