Table of Contents
Fetching ...

Nonverbal Interaction Detection

Jianan Wei, Tianfei Zhou, Yi Yang, Wenguan Wang

TL;DR

Nonverbal interaction understanding is framed as a unified problem that integrates multiple social signals rather than treating them in isolation. The authors introduce the NVI dataset and the NVI-DET task, formalized as the triplet $\langle\text{individual},\text{group},\text{interaction}\rangle$, and propose the dual multi-scale NVI-DEHR hypergraph to capture high-order social relations. The approach delivers state-of-the-art results on NVI-DET and strong generalization to HOI-DET benchmarks, demonstrating effective cross-task transfer. Overall, this work establishes a foundation for holistic social-signal analysis and points to future directions such as temporal dynamics and proximal cues in real-world settings.

Abstract

This work addresses a new challenge of understanding human nonverbal interaction in social contexts. Nonverbal signals pervade virtually every communicative act. Our gestures, facial expressions, postures, gaze, even physical appearance all convey messages, without anything being said. Despite their critical role in social life, nonverbal signals receive very limited attention as compared to the linguistic counterparts, and existing solutions typically examine nonverbal cues in isolation. Our study marks the first systematic effort to enhance the interpretation of multifaceted nonverbal signals. First, we contribute a novel large-scale dataset, called NVI, which is meticulously annotated to include bounding boxes for humans and corresponding social groups, along with 22 atomic-level nonverbal behaviors under five broad interaction types. Second, we establish a new task NVI-DET for nonverbal interaction detection, which is formalized as identifying triplets in the form <individual, group, interaction> from images. Third, we propose a nonverbal interaction detection hypergraph (NVI-DEHR), a new approach that explicitly models high-order nonverbal interactions using hypergraphs. Central to the model is a dual multi-scale hypergraph that adeptly addresses individual-to-individual and group-to-group correlations across varying scales, facilitating interactional feature learning and eventually improving interaction prediction. Extensive experiments on NVI show that NVI-DEHR improves various baselines significantly in NVI-DET. It also exhibits leading performance on HOI-DET, confirming its versatility in supporting related tasks and strong generalization ability. We hope that our study will offer the community new avenues to explore nonverbal signals in more depth.

Nonverbal Interaction Detection

TL;DR

Nonverbal interaction understanding is framed as a unified problem that integrates multiple social signals rather than treating them in isolation. The authors introduce the NVI dataset and the NVI-DET task, formalized as the triplet , and propose the dual multi-scale NVI-DEHR hypergraph to capture high-order social relations. The approach delivers state-of-the-art results on NVI-DET and strong generalization to HOI-DET benchmarks, demonstrating effective cross-task transfer. Overall, this work establishes a foundation for holistic social-signal analysis and points to future directions such as temporal dynamics and proximal cues in real-world settings.

Abstract

This work addresses a new challenge of understanding human nonverbal interaction in social contexts. Nonverbal signals pervade virtually every communicative act. Our gestures, facial expressions, postures, gaze, even physical appearance all convey messages, without anything being said. Despite their critical role in social life, nonverbal signals receive very limited attention as compared to the linguistic counterparts, and existing solutions typically examine nonverbal cues in isolation. Our study marks the first systematic effort to enhance the interpretation of multifaceted nonverbal signals. First, we contribute a novel large-scale dataset, called NVI, which is meticulously annotated to include bounding boxes for humans and corresponding social groups, along with 22 atomic-level nonverbal behaviors under five broad interaction types. Second, we establish a new task NVI-DET for nonverbal interaction detection, which is formalized as identifying triplets in the form <individual, group, interaction> from images. Third, we propose a nonverbal interaction detection hypergraph (NVI-DEHR), a new approach that explicitly models high-order nonverbal interactions using hypergraphs. Central to the model is a dual multi-scale hypergraph that adeptly addresses individual-to-individual and group-to-group correlations across varying scales, facilitating interactional feature learning and eventually improving interaction prediction. Extensive experiments on NVI show that NVI-DEHR improves various baselines significantly in NVI-DET. It also exhibits leading performance on HOI-DET, confirming its versatility in supporting related tasks and strong generalization ability. We hope that our study will offer the community new avenues to explore nonverbal signals in more depth.
Paper Structure (17 sections, 9 equations, 10 figures, 5 tables)

This paper contains 17 sections, 9 equations, 10 figures, 5 tables.

Figures (10)

  • Figure 1: Can you read these humans? Nonverbal interaction (e.g., gaze, gesture) forms the cornerstone of our social life, and serves as the basis of our social intelligence.
  • Figure 2: Examples of the NVI dataset, showing that our dataset covers rich nonverbal signals in diverse social scenes. Bounding box annotations of individuals are marked by green rectangles. To enhance demonstration and clarity, red arrows and numerical identifiers are incorporated additionally.
  • Figure 3: Dataset statistics. (a) Nonverbal interaction taxonomy (§\ref{['sec:taxonomy']}). (b) Distribution of atomic-level nonverbal behaviors (§\ref{['sec:statistics']}).
  • Figure 4: Overall architecture of the proposed NVI-DEHR model. Given an image, the visual encoder is first applied to extract features, followed by an instance decoder that locates human-object pairs. Next, a dual multi-scale hypergraph is designed to model complex interactions between individuals and social groups via hypergraph convolutions. Lastly, an independent transformer decoder is employed to predict the nonverbal interaction categories for each individual-group pair (§\ref{['sec:model']}).
  • Figure 5: NVI-DET results on NVI val and test (§\ref{['sec:expnvi']}).
  • ...and 5 more figures