SocialFusion: Addressing Social Degradation in Pre-trained Vision-Language Models
Hamza Tahboub, Weiyan Shi, Gang Hua, Huaizu Jiang
TL;DR
The paper identifies a phenomenon termed social degradation, where standard pre-training of vision-language models degrades visual encoders for nuanced social perception tasks, causing negative transfer during joint learning. It introduces SocialFusion, a minimal fusion approach that freezes a visual encoder (CLIP-based) and connects to a language model via a lightweight connector and LoRA adapters, enabling positive transfer across five social tasks. Through linear probing and gradient analysis, the authors show that reduced decodability of social features — rather than gradient conflicts — is the primary driver of degradation. Empirically, SocialFusion achieves positive transfer on all tasks and sets new state-of-the-art on HaGRIDv2 and PISC, illustrating the viability of unified social understanding without degrading visual representations and underscoring the need for socially-aware pre-training strategies.
Abstract
Understanding social interactions from visual cues is a fundamental challenge for a socially competent AI. While powerful pre-trained vision-language models (VLMs) have shown remarkable general capabilities, they surprisingly struggle to unify and learn multiple social perception tasks simultaneously, often exhibiting negative transfer. We identify that this negative transfer stems from a critical issue we term "social degradation," whereby the general visual-linguistic pre-training process of VLMs impairs the visual encoder's ability to represent nuanced social information. We investigate this behavior further under two lenses: decodability through linear representation probing and compatibility through gradient conflict analysis, revealing that both play a role in the degradation, especially the former, which is significantly compromised in the VLM pre-training process. To address these issues, we propose SocialFusion, a unified framework that learns a minimal connection between a frozen visual encoder and a language model. Compared with existing VLMs, it exhibits positive transfer across all five social tasks, leveraging synergies between them to enhance overall performance and achieves comparable performance to task-specific state-of-the-art models on various benchmarks. Our findings suggest that current VLM pre-training strategies may be detrimental to acquiring general social competence and highlight the need for more socially-aware training paradigms.
