Table of Contents
Fetching ...

Part-Aware Bottom-Up Group Reasoning for Fine-Grained Social Interaction Detection

Dongkeun Kim, Minsu Cho, Suha Kwak

TL;DR

This work targets fine-grained social interaction detection (NVI-DET) by shifting from holistic, direct group prediction to a part-aware, bottom-up reasoning approach. It introduces part-aware individual embeddings and a group decoder that reason over inter-person relations, using similarity-based association to form groups and predict interactions as triplets $<\mathrm{individual}, \mathrm{group}, \mathrm{interaction}>$. Five losses, including pose-guided part supervision and a group-association loss, are trained end-to-end with Hungarian matching, enabling accurate localization and interaction classification. On the NVI dataset, the method achieves state-of-the-art results, and on Café, it demonstrates strong frame-wise group activity performance, illustrating the approach’s generality beyond static NVI scenarios.

Abstract

Social interactions often emerge from subtle, fine-grained cues such as facial expressions, gaze, and gestures. However, existing methods for social interaction detection overlook such nuanced cues and primarily rely on holistic representations of individuals. Moreover, they directly detect social groups without explicitly modeling the underlying interactions between individuals. These drawbacks limit their ability to capture localized social signals and introduce ambiguity when group configurations should be inferred from social interactions grounded in nuanced cues. In this work, we propose a part-aware bottom-up group reasoning framework for fine-grained social interaction detection. The proposed method infers social groups and their interactions using body part features and their interpersonal relations. Our model first detects individuals and enhances their features using part-aware cues, and then infers group configuration by associating individuals via similarity-based reasoning, which considers not only spatial relations but also subtle social cues that signal interactions, leading to more accurate group inference. Experiments on the NVI dataset demonstrate that our method outperforms prior methods, achieving the new state of the art.

Part-Aware Bottom-Up Group Reasoning for Fine-Grained Social Interaction Detection

TL;DR

This work targets fine-grained social interaction detection (NVI-DET) by shifting from holistic, direct group prediction to a part-aware, bottom-up reasoning approach. It introduces part-aware individual embeddings and a group decoder that reason over inter-person relations, using similarity-based association to form groups and predict interactions as triplets . Five losses, including pose-guided part supervision and a group-association loss, are trained end-to-end with Hungarian matching, enabling accurate localization and interaction classification. On the NVI dataset, the method achieves state-of-the-art results, and on Café, it demonstrates strong frame-wise group activity performance, illustrating the approach’s generality beyond static NVI scenarios.

Abstract

Social interactions often emerge from subtle, fine-grained cues such as facial expressions, gaze, and gestures. However, existing methods for social interaction detection overlook such nuanced cues and primarily rely on holistic representations of individuals. Moreover, they directly detect social groups without explicitly modeling the underlying interactions between individuals. These drawbacks limit their ability to capture localized social signals and introduce ambiguity when group configurations should be inferred from social interactions grounded in nuanced cues. In this work, we propose a part-aware bottom-up group reasoning framework for fine-grained social interaction detection. The proposed method infers social groups and their interactions using body part features and their interpersonal relations. Our model first detects individuals and enhances their features using part-aware cues, and then infers group configuration by associating individuals via similarity-based reasoning, which considers not only spatial relations but also subtle social cues that signal interactions, leading to more accurate group inference. Experiments on the NVI dataset demonstrate that our method outperforms prior methods, achieving the new state of the art.

Paper Structure

This paper contains 20 sections, 7 equations, 9 figures, 14 tables.

Figures (9)

  • Figure 1: Overall architecture of the proposed model. Given an input image, our model extracts the encoded visual feature using the backbone and the transformer encoder. It then detects individuals and derives part-aware individual features through the individual embedding enhancer. Group queries attend to both the encoded feature maps and part-aware individual embeddings to infer social groups and interaction labels. Finally, triplets are obtained through the association module and NMS.
  • Figure 2: Pose-guided binary mask generation for part supervision.
  • Figure 3: Comparison with MLLMs.
  • Figure 3: Visualizations of the cross-attention map from the individual decoder, individual embedding enhancer, and group decoder. Blue and green bounding boxes indicate individuals and groups, respectively. Predicted interaction labels are shown on the right.
  • Figure 4: Qualitative results of our model on the NVI test-set. The first column shows input images, and the remaining columns visualize the predicted NVI-DET triplets. Blue and green boxes denote individuals and groups, respectively. Predicted interaction labels are presented below, where wrong predictions are highlighted in red.
  • ...and 4 more figures