Table of Contents
Fetching ...

SEM: Sparse Embedding Modulation for Post-Hoc Debiasing of Vision-Language Models

Quentin Guimard, Federico Bartsch, Simone Caldarella, Rahaf Aljundi, Elisa Ricci, Massimiliano Mancini

Abstract

Models that bridge vision and language, such as CLIP, are key components of multimodal AI, yet their large-scale, uncurated training data introduce severe social and spurious biases. Existing post-hoc debiasing methods often operate directly in the dense CLIP embedding space, where bias and task-relevant information are highly entangled. This entanglement limits their ability to remove bias without degrading semantic fidelity. In this work, we propose Sparse Embedding Modulation (SEM), a post-hoc, zero-shot debiasing framework that operates in a Sparse Autoencoder (SAE) latent space. By decomposing CLIP text embeddings into disentangled features, SEM identifies and modulates bias-relevant neurons while preserving query-relevant ones. This enables more precise, non-linear interventions. Across four benchmark datasets and two CLIP backbones, SEM achieves substantial fairness gains in retrieval and zero-shot classification. Our results demonstrate that sparse latent representations provide an effective foundation for post-hoc debiasing of vision-language models.

SEM: Sparse Embedding Modulation for Post-Hoc Debiasing of Vision-Language Models

Abstract

Models that bridge vision and language, such as CLIP, are key components of multimodal AI, yet their large-scale, uncurated training data introduce severe social and spurious biases. Existing post-hoc debiasing methods often operate directly in the dense CLIP embedding space, where bias and task-relevant information are highly entangled. This entanglement limits their ability to remove bias without degrading semantic fidelity. In this work, we propose Sparse Embedding Modulation (SEM), a post-hoc, zero-shot debiasing framework that operates in a Sparse Autoencoder (SAE) latent space. By decomposing CLIP text embeddings into disentangled features, SEM identifies and modulates bias-relevant neurons while preserving query-relevant ones. This enables more precise, non-linear interventions. Across four benchmark datasets and two CLIP backbones, SEM achieves substantial fairness gains in retrieval and zero-shot classification. Our results demonstrate that sparse latent representations provide an effective foundation for post-hoc debiasing of vision-language models.
Paper Structure (29 sections, 11 equations, 4 figures, 15 tables)

This paper contains 29 sections, 11 equations, 4 figures, 15 tables.

Figures (4)

  • Figure 1: SAEs decompose entangled embeddings for precise intervention. (a) Standard methods operate directly on the dense, entangled CLIP embedding space. (b) Our SEM first projects the embedding into a sparse, disentangled latent space via an SAE. This enables a precise intervention on specific features, resolving the limitations of dense-space manipulation.
  • Figure 2: SAEs Significantly Improve Feature Disentanglement. We plot our Disentanglement Score (higher is better) , which measures a profession probe's ability to avoid capturing bias. Standard CLIP embeddings (blue) show low disentanglement, while our SAE latent space (orange) consistently increases the score.
  • Figure 3: Overview of the SEM framework. Our method operates in two stages: (a) Scoring: The CLIP embedding is projected into the SAE latent space. Neurons are then scored for content relevance (\ref{['sec:method:scoring:concept']}) and bias sensitivity (\ref{['sec:method:scoring:bias']}) by comparing their activations to pre-computed prompt sets. (b) Steering: The scores are combined into a modulation coefficient $M$ that attenuates bias neurons and boosts content neurons (\ref{['sec:method:steering']}). The final, debiased embedding is reconstructed from this modulated latent vector.
  • Figure 4: Visualizing Debiasing on Entangled Concepts. (a) A 2D PCA of original CLIP embeddings for 100 professions. Gender clusters ('female', 'male') are clearly separated, but the 'neutral' and 'male' ones incorrectly overlap. (b) Orth-Proj achieves a partial overlap between 'male' and 'female' clusters, but fails to merge the 'neutral' cluster and appears to disrupt the data's underlying structure. (c) SEM$_b$ successfully merges all three clusters ('male', 'female', and 'neutral') into a cohesive distribution with a consistent structure.