Table of Contents
Fetching ...

Mitigating Long-Tail Bias in HOI Detection via Adaptive Diversity Cache

Yuqiu Jiang, Xiaozhen Qiao, Tianyu Mei, Haojian Huang, Yifan Chen, Ye Zheng, Zhe Sun

TL;DR

The paper tackles long-tail bias in HOI detection by introducing Adaptive Diversity Cache (ADC), a training-free, plug-in module that builds class-specific caches of high-confidence, diverse features during inference and uses frequency-aware capacity allocation to boost rare-interaction predictions without retraining. ADC comprises Confidence-Diversity Joint Cache Selection (CJCS) and Frequency-Aware Cache Adaptation (FACA), which together expand reference representations and augment predictions via an affinity-based retrieval mechanism. Empirical results on HICO-DET and V-COCO show substantial improvements for rare categories (+8.57% mAP on rare and +4.39% on full on HICO-DET) and positive transfer across multiple baselines, including zero-shot-capable models, with competitive results on V-COCO. Overall, ADC demonstrates that test-time caching and adaptive augmentation can calibrate HOI predictions under long-tail distributions, offering a scalable, training-free approach with potential applicability to other long-tail structured tasks.

Abstract

Human-Object Interaction (HOI) detection is a fundamental task in computer vision, empowering machines to comprehend human-object relationships in diverse real-world scenarios. Recent advances in VLMs have significantly improved HOI detection by leveraging rich cross-modal representations. However, most existing VLM-based approaches rely heavily on additional training or prompt tuning, resulting in substantial computational overhead and limited scalability, particularly in long-tailed scenarios where rare interactions are severely underrepresented. In this paper, we propose the Adaptive Diversity Cache (ADC) module, a novel training-free and plug-and-play mechanism designed to mitigate long-tail bias in HOI detection. ADC constructs class-specific caches that accumulate high-confidence and diverse feature representations during inference. The method incorporates frequency-aware cache adaptation that favors rare categories and is designed to enable robust prediction calibration without requiring additional training or fine-tuning. Extensive experiments on HICO-DET and V-COCO datasets show that ADC consistently improves existing HOI detectors, achieving up to +8.57\% mAP gain on rare categories and +4.39\% on the full dataset, demonstrating its effectiveness in mitigating long-tail bias while preserving overall performance.

Mitigating Long-Tail Bias in HOI Detection via Adaptive Diversity Cache

TL;DR

The paper tackles long-tail bias in HOI detection by introducing Adaptive Diversity Cache (ADC), a training-free, plug-in module that builds class-specific caches of high-confidence, diverse features during inference and uses frequency-aware capacity allocation to boost rare-interaction predictions without retraining. ADC comprises Confidence-Diversity Joint Cache Selection (CJCS) and Frequency-Aware Cache Adaptation (FACA), which together expand reference representations and augment predictions via an affinity-based retrieval mechanism. Empirical results on HICO-DET and V-COCO show substantial improvements for rare categories (+8.57% mAP on rare and +4.39% on full on HICO-DET) and positive transfer across multiple baselines, including zero-shot-capable models, with competitive results on V-COCO. Overall, ADC demonstrates that test-time caching and adaptive augmentation can calibrate HOI predictions under long-tail distributions, offering a scalable, training-free approach with potential applicability to other long-tail structured tasks.

Abstract

Human-Object Interaction (HOI) detection is a fundamental task in computer vision, empowering machines to comprehend human-object relationships in diverse real-world scenarios. Recent advances in VLMs have significantly improved HOI detection by leveraging rich cross-modal representations. However, most existing VLM-based approaches rely heavily on additional training or prompt tuning, resulting in substantial computational overhead and limited scalability, particularly in long-tailed scenarios where rare interactions are severely underrepresented. In this paper, we propose the Adaptive Diversity Cache (ADC) module, a novel training-free and plug-and-play mechanism designed to mitigate long-tail bias in HOI detection. ADC constructs class-specific caches that accumulate high-confidence and diverse feature representations during inference. The method incorporates frequency-aware cache adaptation that favors rare categories and is designed to enable robust prediction calibration without requiring additional training or fine-tuning. Extensive experiments on HICO-DET and V-COCO datasets show that ADC consistently improves existing HOI detectors, achieving up to +8.57\% mAP gain on rare categories and +4.39\% on the full dataset, demonstrating its effectiveness in mitigating long-tail bias while preserving overall performance.

Paper Structure

This paper contains 20 sections, 15 equations, 8 figures, 3 tables.

Figures (8)

  • Figure 1: Comparison of three HOI detection approaches. (a) Alignment-based model; (b) Prompt tuning-based model; (c) Our proposed training-free model.
  • Figure 2: Long-tail distribution of the HICO-DET dataset. (a) Distribution of HOI interaction frequencies shows extreme imbalance across different verb-object pairs, with the most frequent interaction having over 4000 instances, while many rare interactions have fewer than 10 instances. (b) Verb distribution for different objects demonstrates a severe imbalance, where dominant verbs account for over 90% of interactions for certain objects.
  • Figure 3: Architecture Overview. Given an image and text prompts, region features are extracted via the image encoder and text features via the text encoder, yielding $\mathbf{f}_{\text{vis}}$ and $\mathbf{f}_{\text{txt}}$. Interaction logits $\boldsymbol{\Phi} = \mathbf{f}_{\text{vis}} \cdot \mathbf{f}_{\text{txt}}^T$ are computed and input to the Adaptive Diversity Cache module. ADC leverages confidence-diversity joint cache construction, adaptive capacity allocation, and dynamic cache augmentation to update feature caches and enhance rare class prediction dynamically. The final HOI logits $\mathbf{logit}_{\text{final}}$ are obtained by fusing base and cache-augmented predictions.
  • Figure 4: t-SNE visualizations of priority queue image features. As more samples are incorporated, features from each class form progressively tighter clusters, demonstrating improved representativeness.
  • Figure 5: Effect of cache capacity on model performance. Cache capacity of 6 achieves optimal performance balance across all metrics.
  • ...and 3 more figures