Table of Contents
Fetching ...

From a Social Cognitive Perspective: Context-aware Visual Social Relationship Recognition

Shiwei Wu, Chao Zhang, Joya Chen, Tong Xu, Likang Wu, Yao Hu, Enhong Chen

TL;DR

This paper tackles the problem of recognizing contextual social relationships from images, addressing the limitations of detector-based approaches that miss subtle social cues. It introduces ConSoR, which combines Multi-modal Side Adapter Tuning (MSAT) to transfer CLIP semantics into a lightweight backbone, Contextual Interpersonal Reasoning (CIR) to model social properties and contextual cues, and descriptive social prompts with visual-linguistic contrasting to focus on decisive social factors. The approach leverages social vocab corpora and zero-shot CLIP selection to generate prompts, enabling an explainable reasoning process without extra annotations. Experiments on PISC and PIPA show substantial gains over state-of-the-art methods (e.g., +12.2 mAP on PISC-Fine and +9.8 on PIPA), along with ablations demonstrating the effectiveness of MSAT, CIR, prompt design, and backbone scale for robust, detector-free social relation recognition.

Abstract

People's social relationships are often manifested through their surroundings, with certain objects or interactions acting as symbols for specific relationships, e.g., wedding rings, roses, hugs, or holding hands. This brings unique challenges to recognizing social relationships, requiring understanding and capturing the essence of these contexts from visual appearances. However, current methods of social relationship understanding rely on the basic classification paradigm of detected persons and objects, which fails to understand the comprehensive context and often overlooks decisive social factors, especially subtle visual cues. To highlight the social-aware context and intricate details, we propose a novel approach that recognizes \textbf{Con}textual \textbf{So}cial \textbf{R}elationships (\textbf{ConSoR}) from a social cognitive perspective. Specifically, to incorporate social-aware semantics, we build a lightweight adapter upon the frozen CLIP to learn social concepts via our novel multi-modal side adapter tuning mechanism. Further, we construct social-aware descriptive language prompts (e.g., scene, activity, objects, emotions) with social relationships for each image, and then compel ConSoR to concentrate more intensively on the decisive visual social factors via visual-linguistic contrasting. Impressively, ConSoR outperforms previous methods with a 12.2\% gain on the People-in-Social-Context (PISC) dataset and a 9.8\% increase on the People-in-Photo-Album (PIPA) benchmark. Furthermore, we observe that ConSoR excels at finding critical visual evidence to reveal social relationships.

From a Social Cognitive Perspective: Context-aware Visual Social Relationship Recognition

TL;DR

This paper tackles the problem of recognizing contextual social relationships from images, addressing the limitations of detector-based approaches that miss subtle social cues. It introduces ConSoR, which combines Multi-modal Side Adapter Tuning (MSAT) to transfer CLIP semantics into a lightweight backbone, Contextual Interpersonal Reasoning (CIR) to model social properties and contextual cues, and descriptive social prompts with visual-linguistic contrasting to focus on decisive social factors. The approach leverages social vocab corpora and zero-shot CLIP selection to generate prompts, enabling an explainable reasoning process without extra annotations. Experiments on PISC and PIPA show substantial gains over state-of-the-art methods (e.g., +12.2 mAP on PISC-Fine and +9.8 on PIPA), along with ablations demonstrating the effectiveness of MSAT, CIR, prompt design, and backbone scale for robust, detector-free social relation recognition.

Abstract

People's social relationships are often manifested through their surroundings, with certain objects or interactions acting as symbols for specific relationships, e.g., wedding rings, roses, hugs, or holding hands. This brings unique challenges to recognizing social relationships, requiring understanding and capturing the essence of these contexts from visual appearances. However, current methods of social relationship understanding rely on the basic classification paradigm of detected persons and objects, which fails to understand the comprehensive context and often overlooks decisive social factors, especially subtle visual cues. To highlight the social-aware context and intricate details, we propose a novel approach that recognizes \textbf{Con}textual \textbf{So}cial \textbf{R}elationships (\textbf{ConSoR}) from a social cognitive perspective. Specifically, to incorporate social-aware semantics, we build a lightweight adapter upon the frozen CLIP to learn social concepts via our novel multi-modal side adapter tuning mechanism. Further, we construct social-aware descriptive language prompts (e.g., scene, activity, objects, emotions) with social relationships for each image, and then compel ConSoR to concentrate more intensively on the decisive visual social factors via visual-linguistic contrasting. Impressively, ConSoR outperforms previous methods with a 12.2\% gain on the People-in-Social-Context (PISC) dataset and a 9.8\% increase on the People-in-Photo-Album (PIPA) benchmark. Furthermore, we observe that ConSoR excels at finding critical visual evidence to reveal social relationships.
Paper Structure (21 sections, 14 equations, 6 figures, 8 tables)

This paper contains 21 sections, 14 equations, 6 figures, 8 tables.

Figures (6)

  • Figure 1: ConSoR excels at identifying decisive visual social cues. The explicit clues (e.g., party scene in (a)) determine the probable intimate relation, while ConSoR aids in uncovering implicit social cues (e.g., the presence of the child and their cuddle interaction), further pinpointing the accurate Couple relation. Additionally, the observed 'transitivity' property in social relations also supports this conclusion. In (b), the visual-linguistic contrasting helps to capture the undetected yet critical contextual social cues (e.g., flowers and intimate stares), which have been overlooked by previous methods.
  • Figure 2: This figure depicts the ConSoR framework, where the frozen visual and linguistic CLIP encoders employ a shared multi-modal side adapter to learn social-aware representations. Then, the contextual interpersonal reasoning module extracts individual-pair features, followed by a CLIP-style visual-linguistic contrasting head for social relationship classification.
  • Figure 3: Illustration of our proposed interpersonal and contextual reasoning modules. The interpersonal reasoning module can model the crucial properties of social relations, such as transitivity and reflexivity. Furthermore, the model attends to social-aware contexts, such as 'holding hands' in the image, through the contextual reasoning module.
  • Figure 4: The pipeline for constructing descriptive contextual social prompts. With social prompts, ConSoR integrates rich linguistic information into visual social relation tasks without requiring additional annotations. This integration further enables the model to identify social-decisive contexts through the following visual-linguistic contrasting classification.
  • Figure 5: Ablation for ConSoR on PISC-Fine dataset. In Figure \ref{['fig:ablation']}a, ConR, IntR, GCF represents Context Reasoning, Interpersonal Reasoning and Global Context Fusion in CIR module. In Figure \ref{['fig:ablation']}b, SC, SA, OC, E means scene category, scene attribute, object category and emotion corpora, respectively.
  • ...and 1 more figures