Table of Contents
Fetching ...

Incremental Human-Object Interaction Detection with Invariant Relation Representation Learning

Yana Wei, Zeen Chi, Chongyu Wang, Yu Wu, Shipeng Yan, Yongfei Liu, Xuming He

TL;DR

This work tackles open-world HOI detection under incremental learning by formulating Incremental Human-Object Interaction Detection (IHOID) and proposing an exemplar-free Incremental Relation Distillation (IRD) framework. IRD decouples object and relation learning and introduces two distillation strategies—Momentum Feature Distillation (MFD) and Concept Feature Distillation (CFD)—backed by a momentum teacher and a dynamic concept-feature dictionary to preserve invariant relation representations across phases. The approach addresses catastrophic forgetting, interaction drift, and zero-shot generalization, and demonstrates superior performance on HICO-DET and V-COCO compared with strong baselines, including zero-shot detectors and generalized incremental methods. The results show improved stability-plasticity balance, robustness to drift, and enhanced zero-shot HOI detection, indicating practical value for adaptive HOI systems in dynamic environments.

Abstract

In open-world environments, human-object interactions (HOIs) evolve continuously, challenging conventional closed-world HOI detection models. Inspired by humans' ability to progressively acquire knowledge, we explore incremental HOI detection (IHOID) to develop agents capable of discerning human-object relations in such dynamic environments. This setup confronts not only the common issue of catastrophic forgetting in incremental learning but also distinct challenges posed by interaction drift and detecting zero-shot HOI combinations with sequentially arriving data. Therefore, we propose a novel exemplar-free incremental relation distillation (IRD) framework. IRD decouples the learning of objects and relations, and introduces two unique distillation losses for learning invariant relation features across different HOI combinations that share the same relation. Extensive experiments on HICO-DET and V-COCO datasets demonstrate the superiority of our method over state-of-the-art baselines in mitigating forgetting, strengthening robustness against interaction drift, and generalization on zero-shot HOIs. Code is available at \href{https://github.com/weiyana/ContinualHOI}{this HTTP URL}

Incremental Human-Object Interaction Detection with Invariant Relation Representation Learning

TL;DR

This work tackles open-world HOI detection under incremental learning by formulating Incremental Human-Object Interaction Detection (IHOID) and proposing an exemplar-free Incremental Relation Distillation (IRD) framework. IRD decouples object and relation learning and introduces two distillation strategies—Momentum Feature Distillation (MFD) and Concept Feature Distillation (CFD)—backed by a momentum teacher and a dynamic concept-feature dictionary to preserve invariant relation representations across phases. The approach addresses catastrophic forgetting, interaction drift, and zero-shot generalization, and demonstrates superior performance on HICO-DET and V-COCO compared with strong baselines, including zero-shot detectors and generalized incremental methods. The results show improved stability-plasticity balance, robustness to drift, and enhanced zero-shot HOI detection, indicating practical value for adaptive HOI systems in dynamic environments.

Abstract

In open-world environments, human-object interactions (HOIs) evolve continuously, challenging conventional closed-world HOI detection models. Inspired by humans' ability to progressively acquire knowledge, we explore incremental HOI detection (IHOID) to develop agents capable of discerning human-object relations in such dynamic environments. This setup confronts not only the common issue of catastrophic forgetting in incremental learning but also distinct challenges posed by interaction drift and detecting zero-shot HOI combinations with sequentially arriving data. Therefore, we propose a novel exemplar-free incremental relation distillation (IRD) framework. IRD decouples the learning of objects and relations, and introduces two unique distillation losses for learning invariant relation features across different HOI combinations that share the same relation. Extensive experiments on HICO-DET and V-COCO datasets demonstrate the superiority of our method over state-of-the-art baselines in mitigating forgetting, strengthening robustness against interaction drift, and generalization on zero-shot HOIs. Code is available at \href{https://github.com/weiyana/ContinualHOI}{this HTTP URL}

Paper Structure

This paper contains 57 sections, 6 equations, 7 figures, 8 tables.

Figures (7)

  • Figure 1: Training and evaluation of IHOID. The model learns object-relation pairs incrementally and must detect past and new HOIs, mitigate interaction drift, and recognize zero-shot HOIs.
  • Figure 2: (a) Performance degradation of the SOTA HOI detector PViC in the IHOID setup: The yellow plot shows the incremental training performance of PViC on our partitioned HICO-DET dataset. The red star denotes the performance achieved by PViC under a joint training setup with an identical dataset, which serves as the upper bound for the model trained in the IHOID setup. (b) Demonstration of interaction drift: The statistics show the APs of HOI categories which are related to the same relation categories that occur across training phase 1 and phase 2. The APs of these categories suffer from obvious decreases.
  • Figure 3: The pipeline of our relation representation learning framework. At each training phase $t$, the object branch outputs the box pair information $p$ and the global image feature $\mathbf{g}$. These are then fed into the relation branch, where a momentum teacher processes them to produce the reference relation feature ${\mathbf{z}}=f(p,\mathbf{g};\theta_s)$, subsequently stored in the concept-feature dictionary. Concurrently, the current encoder takes the same input and yields $f(p,\mathbf{g};\theta_c^t)$, facilitating the computation of distillation losses $\mathcal{L}_{MFD}$ and $\mathcal{L}_{CFD}$ with ${\mathbf{z}}$ and the invariant relation feature $\mathbf{\bar{z}}$ randomly retrieved from the dictionary, respectively.
  • Figure 4: Performances w.r.t. learning phases on HICO-DET and V-COCO benchmarks for overall performance (Overall), robustness to interaction drift (RID), and zero-shot detection performance (Zero-shot).
  • Figure 5: The comparison between the visualization results of baselines and IRD in the 5-phase incremental setting. (a)-(c) depict the results following the 1st learning phase, whereas (d)-(f) illustrate the results after completing the 5th learning phase.
  • ...and 2 more figures