Part-Aware Bottom-Up Group Reasoning for Fine-Grained Social Interaction Detection
Dongkeun Kim, Minsu Cho, Suha Kwak
TL;DR
This work targets fine-grained social interaction detection (NVI-DET) by shifting from holistic, direct group prediction to a part-aware, bottom-up reasoning approach. It introduces part-aware individual embeddings and a group decoder that reason over inter-person relations, using similarity-based association to form groups and predict interactions as triplets $<\mathrm{individual}, \mathrm{group}, \mathrm{interaction}>$. Five losses, including pose-guided part supervision and a group-association loss, are trained end-to-end with Hungarian matching, enabling accurate localization and interaction classification. On the NVI dataset, the method achieves state-of-the-art results, and on Café, it demonstrates strong frame-wise group activity performance, illustrating the approach’s generality beyond static NVI scenarios.
Abstract
Social interactions often emerge from subtle, fine-grained cues such as facial expressions, gaze, and gestures. However, existing methods for social interaction detection overlook such nuanced cues and primarily rely on holistic representations of individuals. Moreover, they directly detect social groups without explicitly modeling the underlying interactions between individuals. These drawbacks limit their ability to capture localized social signals and introduce ambiguity when group configurations should be inferred from social interactions grounded in nuanced cues. In this work, we propose a part-aware bottom-up group reasoning framework for fine-grained social interaction detection. The proposed method infers social groups and their interactions using body part features and their interpersonal relations. Our model first detects individuals and enhances their features using part-aware cues, and then infers group configuration by associating individuals via similarity-based reasoning, which considers not only spatial relations but also subtle social cues that signal interactions, leading to more accurate group inference. Experiments on the NVI dataset demonstrate that our method outperforms prior methods, achieving the new state of the art.
