Table of Contents
Fetching ...

EZ-HOI: VLM Adaptation via Guided Prompt Learning for Zero-Shot HOI Detection

Qinqian Lei, Bo Wang, Robby T. Tan

TL;DR

EZ-HOI tackles zero-shot HOI detection by adapting a Vision-Language Model through guided prompt learning that leverages both LLM-derived HOI descriptions and fixed VLM visual semantics. The framework introduces Unseen-Class Text Prompt Learning (UTPL) to transfer information from related seen classes and uses disparity information from an LLM to distinguish unseen from related seen HOIs, while deep visual-text prompts and intra-/inter-HOI fusion enhance visual representations. The approach achieves state-of-the-art or competitive performance across multiple zero-shot settings while dramatically reducing trainable parameters, demonstrating strong efficiency and generalization. These advances offer practical impact for robust HOI understanding with limited annotated data, and point toward open-category and broader-impact considerations for future work.

Abstract

Detecting Human-Object Interactions (HOI) in zero-shot settings, where models must handle unseen classes, poses significant challenges. Existing methods that rely on aligning visual encoders with large Vision-Language Models (VLMs) to tap into the extensive knowledge of VLMs, require large, computationally expensive models and encounter training difficulties. Adapting VLMs with prompt learning offers an alternative to direct alignment. However, fine-tuning on task-specific datasets often leads to overfitting to seen classes and suboptimal performance on unseen classes, due to the absence of unseen class labels. To address these challenges, we introduce a novel prompt learning-based framework for Efficient Zero-Shot HOI detection (EZ-HOI). First, we introduce Large Language Model (LLM) and VLM guidance for learnable prompts, integrating detailed HOI descriptions and visual semantics to adapt VLMs to HOI tasks. However, because training datasets contain seen-class labels alone, fine-tuning VLMs on such datasets tends to optimize learnable prompts for seen classes instead of unseen ones. Therefore, we design prompt learning for unseen classes using information from related seen classes, with LLMs utilized to highlight the differences between unseen and related seen classes. Quantitative evaluations on benchmark datasets demonstrate that our EZ-HOI achieves state-of-the-art performance across various zero-shot settings with only 10.35% to 33.95% of the trainable parameters compared to existing methods. Code is available at https://github.com/ChelsieLei/EZ-HOI.

EZ-HOI: VLM Adaptation via Guided Prompt Learning for Zero-Shot HOI Detection

TL;DR

EZ-HOI tackles zero-shot HOI detection by adapting a Vision-Language Model through guided prompt learning that leverages both LLM-derived HOI descriptions and fixed VLM visual semantics. The framework introduces Unseen-Class Text Prompt Learning (UTPL) to transfer information from related seen classes and uses disparity information from an LLM to distinguish unseen from related seen HOIs, while deep visual-text prompts and intra-/inter-HOI fusion enhance visual representations. The approach achieves state-of-the-art or competitive performance across multiple zero-shot settings while dramatically reducing trainable parameters, demonstrating strong efficiency and generalization. These advances offer practical impact for robust HOI understanding with limited annotated data, and point toward open-category and broader-impact considerations for future work.

Abstract

Detecting Human-Object Interactions (HOI) in zero-shot settings, where models must handle unseen classes, poses significant challenges. Existing methods that rely on aligning visual encoders with large Vision-Language Models (VLMs) to tap into the extensive knowledge of VLMs, require large, computationally expensive models and encounter training difficulties. Adapting VLMs with prompt learning offers an alternative to direct alignment. However, fine-tuning on task-specific datasets often leads to overfitting to seen classes and suboptimal performance on unseen classes, due to the absence of unseen class labels. To address these challenges, we introduce a novel prompt learning-based framework for Efficient Zero-Shot HOI detection (EZ-HOI). First, we introduce Large Language Model (LLM) and VLM guidance for learnable prompts, integrating detailed HOI descriptions and visual semantics to adapt VLMs to HOI tasks. However, because training datasets contain seen-class labels alone, fine-tuning VLMs on such datasets tends to optimize learnable prompts for seen classes instead of unseen ones. Therefore, we design prompt learning for unseen classes using information from related seen classes, with LLMs utilized to highlight the differences between unseen and related seen classes. Quantitative evaluations on benchmark datasets demonstrate that our EZ-HOI achieves state-of-the-art performance across various zero-shot settings with only 10.35% to 33.95% of the trainable parameters compared to existing methods. Code is available at https://github.com/ChelsieLei/EZ-HOI.

Paper Structure

This paper contains 28 sections, 18 equations, 7 figures, 8 tables.

Figures (7)

  • Figure 1: Comparison of zero-shot HOI detection paradigms. (a) Methods that align HOI features with fixed VLMs ning2023hoiclipLiao_2022_CVPRcao2024detectingmao2024clip4hoi. (b) Prompt learning methods to adapt VLMs for downstream tasks khattak2023maplezang2022unified. (c) Our approach, which adapts VLMs to HOI tasks without compromising VLM generation capabilities. (d) Unseen, seen, and full mAP indicate the performance for unseen-verb, seen-verb, and full sets on the HICO-DET dataset chao2018learning. Our EZ-HOI shows superior performance in these categories, with competitive trainable parameters and training epochs.
  • Figure 2: Overview of our EZ-HOI framework. Learnable text prompts capture detailed HOI class information from the LLM. To enhance their generalization ability, we introduce the Unseen Text Prompt Learning (UTPL) module. Meanwhile, visual learnable prompts are guided by a frozen VLM visual encoder. These learnable text and visual prompts are then separately input into the text and visual encoder. Finally, HOI predictions are made by calculating the cosine similarity between the text encoder output and the HOI image features. MHCA denotes multi-head cross-attention.
  • Figure 3: Detailed architecture of Unseen Text Prompt Learning (UTPL). In the figure, we take the "hose a dog" unseen HOI class in the unseen-verb zero-shot setting as an example. We first utilize the HOI class text embeddings to identify the most connected seen HOI class to "hose a dog". After selecting the seen class, we generate an input prompt to obtain disparity information from LLM. Finally, the unseen learnable prompt learns from the selected seen class prompt and the disparity information through MHCA.
  • Figure 4: Qualitative comparison with MaPLe khattak2023maple for unseen-verb zero-shot HOI detection.The orange bar represents the unseen class prediction and the blue bar means the seen class prediction.
  • Figure 5: Detailed architecture for HOI feature fusion design. Intra-HOI feature fusion aims to extract HOI features from possible human region and object region features. Inter-HOI feature fusion aims to enhance the HOI features by incorporating the surrounding HOI feature context. "MHSA" refers to multi-head self-attention.
  • ...and 2 more figures