Table of Contents
Fetching ...

SocialFusion: Addressing Social Degradation in Pre-trained Vision-Language Models

Hamza Tahboub, Weiyan Shi, Gang Hua, Huaizu Jiang

TL;DR

The paper identifies a phenomenon termed social degradation, where standard pre-training of vision-language models degrades visual encoders for nuanced social perception tasks, causing negative transfer during joint learning. It introduces SocialFusion, a minimal fusion approach that freezes a visual encoder (CLIP-based) and connects to a language model via a lightweight connector and LoRA adapters, enabling positive transfer across five social tasks. Through linear probing and gradient analysis, the authors show that reduced decodability of social features — rather than gradient conflicts — is the primary driver of degradation. Empirically, SocialFusion achieves positive transfer on all tasks and sets new state-of-the-art on HaGRIDv2 and PISC, illustrating the viability of unified social understanding without degrading visual representations and underscoring the need for socially-aware pre-training strategies.

Abstract

Understanding social interactions from visual cues is a fundamental challenge for a socially competent AI. While powerful pre-trained vision-language models (VLMs) have shown remarkable general capabilities, they surprisingly struggle to unify and learn multiple social perception tasks simultaneously, often exhibiting negative transfer. We identify that this negative transfer stems from a critical issue we term "social degradation," whereby the general visual-linguistic pre-training process of VLMs impairs the visual encoder's ability to represent nuanced social information. We investigate this behavior further under two lenses: decodability through linear representation probing and compatibility through gradient conflict analysis, revealing that both play a role in the degradation, especially the former, which is significantly compromised in the VLM pre-training process. To address these issues, we propose SocialFusion, a unified framework that learns a minimal connection between a frozen visual encoder and a language model. Compared with existing VLMs, it exhibits positive transfer across all five social tasks, leveraging synergies between them to enhance overall performance and achieves comparable performance to task-specific state-of-the-art models on various benchmarks. Our findings suggest that current VLM pre-training strategies may be detrimental to acquiring general social competence and highlight the need for more socially-aware training paradigms.

SocialFusion: Addressing Social Degradation in Pre-trained Vision-Language Models

TL;DR

The paper identifies a phenomenon termed social degradation, where standard pre-training of vision-language models degrades visual encoders for nuanced social perception tasks, causing negative transfer during joint learning. It introduces SocialFusion, a minimal fusion approach that freezes a visual encoder (CLIP-based) and connects to a language model via a lightweight connector and LoRA adapters, enabling positive transfer across five social tasks. Through linear probing and gradient analysis, the authors show that reduced decodability of social features — rather than gradient conflicts — is the primary driver of degradation. Empirically, SocialFusion achieves positive transfer on all tasks and sets new state-of-the-art on HaGRIDv2 and PISC, illustrating the viability of unified social understanding without degrading visual representations and underscoring the need for socially-aware pre-training strategies.

Abstract

Understanding social interactions from visual cues is a fundamental challenge for a socially competent AI. While powerful pre-trained vision-language models (VLMs) have shown remarkable general capabilities, they surprisingly struggle to unify and learn multiple social perception tasks simultaneously, often exhibiting negative transfer. We identify that this negative transfer stems from a critical issue we term "social degradation," whereby the general visual-linguistic pre-training process of VLMs impairs the visual encoder's ability to represent nuanced social information. We investigate this behavior further under two lenses: decodability through linear representation probing and compatibility through gradient conflict analysis, revealing that both play a role in the degradation, especially the former, which is significantly compromised in the VLM pre-training process. To address these issues, we propose SocialFusion, a unified framework that learns a minimal connection between a frozen visual encoder and a language model. Compared with existing VLMs, it exhibits positive transfer across all five social tasks, leveraging synergies between them to enhance overall performance and achieves comparable performance to task-specific state-of-the-art models on various benchmarks. Our findings suggest that current VLM pre-training strategies may be detrimental to acquiring general social competence and highlight the need for more socially-aware training paradigms.

Paper Structure

This paper contains 20 sections, 4 equations, 10 figures, 9 tables.

Figures (10)

  • Figure 1: SocialFusion achieves positive transfer across all visual social interaction understanding tasks. Popular open-source VLMs suffer from social degradation, where their visual encoders have been degraded by the visual-linguistic pre-training, leading to negative transfer when jointly trained on multiple visual social interaction understanding tasks. SocialFusion addresses this limitation by using a frozen visual encoder and a minimal fusion architecture, achieving positive transfer on all tested task metrics while remaining competitive across the board.
  • Figure 2: Visual examples from each of the visual social interaction understanding tasks. From left to right, the tasks are Ego4D Looking At Me (LAM), AffectNet facial expression recognition, HaGRIDv2 gesture recognition, PISC social situation analysis, and GazeFollow gaze target estimation.
  • Figure 3: Our SocialFusion architecture. The input is an image and an optional set of bounding boxes. The bounding boxes are converted to a binary mask and multiplied by a learned embedding to create the bounding box embedding mask. The image is patchified and processed by the backbone. Then, the connector projects it to the embedding space of the LLM backbone, and the bounding box embeddings are added to this feature map. These updated feature maps are flattened and fed into the LLM alongside task-specific information. For classification tasks, the output logits are converted to tokens and interpreted as text. For gaze estimation, the final image features are linearly projected and upscaled to the desired 2D distribution.
  • Figure 4: Examples of SocialFusion's outputs on the gaze estimation task (GazeFollow). The top row includes examples of input images from the dataset as well as the ground truth annotations (one point per annotator). The bottom row is our model's output heatmap distribution per image.
  • Figure 5: Examples of SocialFusion's successful outputs on each of the classification tasks.
  • ...and 5 more figures