Table of Contents
Fetching ...

Learning from Observer Gaze:Zero-Shot Attention Prediction Oriented by Human-Object Interaction Recognition

Yuchen Zhou, Linkai Liu, Chao Gou

TL;DR

This work tackles the challenge of interaction-oriented visual attention by introducing Zero-shot Interaction-oriented Attention (ZeroIA) and the Interactive Gaze (IG) dataset, the first gaze dataset focused on human–object interactions. It proposes the Interactive Attention (IA) model, which leverages interaction-oriented prompts and adapters to form knowledge prototypes and processes HOI scenes through cognitive blocks to predict an interaction heatmap $m_{IA}$, supervised by real gaze $m_H$. The authors further show that incorporating interaction-oriented attention into HOI training—using either real attention or IA-generated pseudo-labels—improves HOI recognition and interpretability across multiple models, including gains on rare HOI cases. Overall, the work demonstrates a bidirectional link between human attention and HOI understanding, with practical implications for human-AI collaboration and interpretable action reasoning.

Abstract

Most existing attention prediction research focuses on salient instances like humans and objects. However, the more complex interaction-oriented attention, arising from the comprehension of interactions between instances by human observers, remains largely unexplored. This is equally crucial for advancing human-machine interaction and human-centered artificial intelligence. To bridge this gap, we first collect a novel gaze fixation dataset named IG, comprising 530,000 fixation points across 740 diverse interaction categories, capturing visual attention during human observers cognitive processes of interactions. Subsequently, we introduce the zero-shot interaction-oriented attention prediction task ZeroIA, which challenges models to predict visual cues for interactions not encountered during training. Thirdly, we present the Interactive Attention model IA, designed to emulate human observers cognitive processes to tackle the ZeroIA problem. Extensive experiments demonstrate that the proposed IA outperforms other state-of-the-art approaches in both ZeroIA and fully supervised settings. Lastly, we endeavor to apply interaction-oriented attention to the interaction recognition task itself. Further experimental results demonstrate the promising potential to enhance the performance and interpretability of existing state-of-the-art HOI models by incorporating real human attention data from IG and attention labels generated by IA.

Learning from Observer Gaze:Zero-Shot Attention Prediction Oriented by Human-Object Interaction Recognition

TL;DR

This work tackles the challenge of interaction-oriented visual attention by introducing Zero-shot Interaction-oriented Attention (ZeroIA) and the Interactive Gaze (IG) dataset, the first gaze dataset focused on human–object interactions. It proposes the Interactive Attention (IA) model, which leverages interaction-oriented prompts and adapters to form knowledge prototypes and processes HOI scenes through cognitive blocks to predict an interaction heatmap , supervised by real gaze . The authors further show that incorporating interaction-oriented attention into HOI training—using either real attention or IA-generated pseudo-labels—improves HOI recognition and interpretability across multiple models, including gains on rare HOI cases. Overall, the work demonstrates a bidirectional link between human attention and HOI understanding, with practical implications for human-AI collaboration and interpretable action reasoning.

Abstract

Most existing attention prediction research focuses on salient instances like humans and objects. However, the more complex interaction-oriented attention, arising from the comprehension of interactions between instances by human observers, remains largely unexplored. This is equally crucial for advancing human-machine interaction and human-centered artificial intelligence. To bridge this gap, we first collect a novel gaze fixation dataset named IG, comprising 530,000 fixation points across 740 diverse interaction categories, capturing visual attention during human observers cognitive processes of interactions. Subsequently, we introduce the zero-shot interaction-oriented attention prediction task ZeroIA, which challenges models to predict visual cues for interactions not encountered during training. Thirdly, we present the Interactive Attention model IA, designed to emulate human observers cognitive processes to tackle the ZeroIA problem. Extensive experiments demonstrate that the proposed IA outperforms other state-of-the-art approaches in both ZeroIA and fully supervised settings. Lastly, we endeavor to apply interaction-oriented attention to the interaction recognition task itself. Further experimental results demonstrate the promising potential to enhance the performance and interpretability of existing state-of-the-art HOI models by incorporating real human attention data from IG and attention labels generated by IA.
Paper Structure (14 sections, 12 equations, 7 figures, 5 tables)

This paper contains 14 sections, 12 equations, 7 figures, 5 tables.

Figures (7)

  • Figure 1: Previous attention prediction models have traditionally focused on the instance-level, primarily emphasizing foreground humans and objects. In contrast, our proposed interaction-oriented attention aims to capture subtler and more fine-grained visual cues associated with actions, such as body parts (row 1), human-object contact (row 2), and scene context (row 3). This proposition challenges the research community with a more intricate and cognitively demanding task.
  • Figure 2: Our proposed IG is the first interaction-centric gaze fixation dataset, comprising 530K fixation points across 740 interaction categories.
  • Figure 3: The overall architecture of our Interactive Attention (IA). Inspired by the HOI cognitive process of human observers, IA is divided into three phases: empirical knowledge representation, cognitive goal modeling, and progressive understanding. First, a set of interaction-oriented prompts activate and leverage the robust knowledge representation capability of CLIP. Secondly, positional and visual adapters are introduced to acquire scene-adaptive human, object, and interaction Knowledge Prototypes (KPs) along with visual features of the HOI scene. Thirdly, guided by these KPs, IA progressively comprehends the scene, starting with an instance-level understanding of humans and objects and deepening insight into their interactions. The decoder generates predicted attention maps, supervised by the real attention maps of human observers using $L_{H2IA}$.
  • Figure 4: We incorporate aligned attention into the existing HOI model training pipeline, divided into two strategies: supervising by a limited amount of real attention from human observers and supervising by a large number of attention pseudo-labels generated by our proposed IA.
  • Figure 5: Qualitative comparison of interaction-oriented attention prediction under the ZeroIA setting.
  • ...and 2 more figures