Table of Contents
Fetching ...

FairViT: Fair Vision Transformer via Adaptive Masking

Bowei Tian, Ruijie Du, Yanning Shen

TL;DR

This work tackles fairness-accuracy trade-offs in Vision Transformers by introducing FairViT, which integrates adaptive masking of attention across sensitive groups with a distance-based regularizer. The adaptive masking learns group-specific masks and weights to control information flow, while the distance loss leverages a validation-time hyperplane to push predictions toward the correct class and away from competitors. Empirical results on CelebA show that FairViT improves accuracy and fairness (BA, DP, EO) with competitive time costs compared to strong baselines, and ablations confirm the importance of both components. The approach is extendable to other architectures and tasks, offering a practical, scalable pathway to fair and accurate representations in vision systems.

Abstract

Vision Transformer (ViT) has achieved excellent performance and demonstrated its promising potential in various computer vision tasks. The wide deployment of ViT in real-world tasks requires a thorough understanding of the societal impact of the model. However, most ViT-based works do not take fairness into account and it is unclear whether directly applying CNN-oriented debiased algorithm to ViT is feasible. Moreover, previous works typically sacrifice accuracy for fairness. Therefore, we aim to develop an algorithm that improves accuracy without sacrificing fairness. In this paper, we propose FairViT, a novel accurate and fair ViT framework. To this end, we introduce a novel distance loss and deploy adaptive fairness-aware masks on attention layers updating with model parameters. Experimental results show \sys can achieve accuracy better than other alternatives, even with competitive computational efficiency. Furthermore, \sys achieves appreciable fairness results.

FairViT: Fair Vision Transformer via Adaptive Masking

TL;DR

This work tackles fairness-accuracy trade-offs in Vision Transformers by introducing FairViT, which integrates adaptive masking of attention across sensitive groups with a distance-based regularizer. The adaptive masking learns group-specific masks and weights to control information flow, while the distance loss leverages a validation-time hyperplane to push predictions toward the correct class and away from competitors. Empirical results on CelebA show that FairViT improves accuracy and fairness (BA, DP, EO) with competitive time costs compared to strong baselines, and ablations confirm the importance of both components. The approach is extendable to other architectures and tasks, offering a practical, scalable pathway to fair and accurate representations in vision systems.

Abstract

Vision Transformer (ViT) has achieved excellent performance and demonstrated its promising potential in various computer vision tasks. The wide deployment of ViT in real-world tasks requires a thorough understanding of the societal impact of the model. However, most ViT-based works do not take fairness into account and it is unclear whether directly applying CNN-oriented debiased algorithm to ViT is feasible. Moreover, previous works typically sacrifice accuracy for fairness. Therefore, we aim to develop an algorithm that improves accuracy without sacrificing fairness. In this paper, we propose FairViT, a novel accurate and fair ViT framework. To this end, we introduce a novel distance loss and deploy adaptive fairness-aware masks on attention layers updating with model parameters. Experimental results show \sys can achieve accuracy better than other alternatives, even with competitive computational efficiency. Furthermore, \sys achieves appreciable fairness results.
Paper Structure (22 sections, 16 equations, 8 figures, 5 tables, 1 algorithm)

This paper contains 22 sections, 16 equations, 8 figures, 5 tables, 1 algorithm.

Figures (8)

  • Figure 1: An illustration of FairViT . For the forward propagation, we first apply weight $\varsigma$ to $\textbf{M}_{l,h}$, calculate the weighted sum $\widetilde{\textbf{M}}_{l,h}$, which is utilized to assist attention mechanism to control the information flow. For the backward propagation, we optimize $\mathbf{M}_{l,h,i}$ and $\varsigma_i$. Additionally, we introduce a novel distance loss $L_{dist}$.
  • Figure 2: The split of the dataset in our design. Each part only contains samples from one sensitive group, and each part in one sensitive group contains the same number of images, but the number of images in one part between different sensitive groups does not have to be equal.
  • Figure 3: An illustration of the update process. We ascertain the specific part $i$ to which the training sample belongs, and $\nabla$ refers to the gradient calculation, specified in Equation (\ref{['eq:7']}-\ref{['eq:8']}). The gray blocks signify that the gradients are zero during the backward pass of this training sample.
  • Figure 4: Impact of $G$. Shown is the mean $\pm$ standard deviation of 3 independent runs.
  • Figure 5: The interpretability study of FairViT .
  • ...and 3 more figures