Table of Contents
Fetching ...

ALBAR: Adversarial Learning approach to mitigate Biases in Action Recognition

Joseph Fioresi, Ishan Rajendrakumar Dave, Mubarak Shah

TL;DR

This work tackles bias in action recognition by addressing both background and foreground cues through ALBAR, an adversarial framework that operates within a single 3D encoder. It uses a static clip adversarial loss on a motionless input, an entropy maximization term to prevent trivial static-predictions, and a gradient penalty to stabilize training, all combined in a weighted objective with the standard cross-entropy loss. The method achieves state-of-the-art debiasing on SCUBA/SCUFO benchmarks, improves HMDB51 by about 12% in contrasted accuracy, and corrects background-leakage issues in UCF101 through refined actor segmentation; it also shows transferability to downstream video understanding tasks. These results suggest ALBAR can provide robust, fair action recognition in realistic, biased settings and can synergize with existing debiasing augmentations.

Abstract

Bias in machine learning models can lead to unfair decision making, and while it has been well-studied in the image and text domains, it remains underexplored in action recognition. Action recognition models often suffer from background bias (i.e., inferring actions based on background cues) and foreground bias (i.e., relying on subject appearance), which can be detrimental to real-life applications such as autonomous vehicles or assisted living monitoring. While prior approaches have mainly focused on mitigating background bias using specialized augmentations, we thoroughly study both foreground and background bias. We propose ALBAR, a novel adversarial training method that mitigates foreground and background biases without requiring specialized knowledge of the bias attributes. Our framework applies an adversarial cross-entropy loss to the sampled static clip (where all the frames are the same) and aims to make its class probabilities uniform using a proposed entropy maximization loss. Additionally, we introduce a gradient penalty loss for regularization against the debiasing process. We evaluate our method on established background and foreground bias protocols, setting a new state-of-the-art and strongly improving combined debiasing performance by over 12% absolute on HMDB51. Furthermore, we identify an issue of background leakage in the existing UCF101 protocol for bias evaluation which provides a shortcut to predict actions and does not provide an accurate measure of the debiasing capability of a model. We address this issue by proposing more fine-grained segmentation boundaries for the actor, where our method also outperforms existing approaches. Project Page: https://joefioresi718.github.io/ALBAR_webpage/

ALBAR: Adversarial Learning approach to mitigate Biases in Action Recognition

TL;DR

This work tackles bias in action recognition by addressing both background and foreground cues through ALBAR, an adversarial framework that operates within a single 3D encoder. It uses a static clip adversarial loss on a motionless input, an entropy maximization term to prevent trivial static-predictions, and a gradient penalty to stabilize training, all combined in a weighted objective with the standard cross-entropy loss. The method achieves state-of-the-art debiasing on SCUBA/SCUFO benchmarks, improves HMDB51 by about 12% in contrasted accuracy, and corrects background-leakage issues in UCF101 through refined actor segmentation; it also shows transferability to downstream video understanding tasks. These results suggest ALBAR can provide robust, fair action recognition in realistic, biased settings and can synergize with existing debiasing augmentations.

Abstract

Bias in machine learning models can lead to unfair decision making, and while it has been well-studied in the image and text domains, it remains underexplored in action recognition. Action recognition models often suffer from background bias (i.e., inferring actions based on background cues) and foreground bias (i.e., relying on subject appearance), which can be detrimental to real-life applications such as autonomous vehicles or assisted living monitoring. While prior approaches have mainly focused on mitigating background bias using specialized augmentations, we thoroughly study both foreground and background bias. We propose ALBAR, a novel adversarial training method that mitigates foreground and background biases without requiring specialized knowledge of the bias attributes. Our framework applies an adversarial cross-entropy loss to the sampled static clip (where all the frames are the same) and aims to make its class probabilities uniform using a proposed entropy maximization loss. Additionally, we introduce a gradient penalty loss for regularization against the debiasing process. We evaluate our method on established background and foreground bias protocols, setting a new state-of-the-art and strongly improving combined debiasing performance by over 12% absolute on HMDB51. Furthermore, we identify an issue of background leakage in the existing UCF101 protocol for bias evaluation which provides a shortcut to predict actions and does not provide an accurate measure of the debiasing capability of a model. We address this issue by proposing more fine-grained segmentation boundaries for the actor, where our method also outperforms existing approaches. Project Page: https://joefioresi718.github.io/ALBAR_webpage/

Paper Structure

This paper contains 19 sections, 5 equations, 6 figures, 12 tables.

Figures (6)

  • Figure 1: Given video clip $\mathbf{x}^{(i)}_{t}$, we sample a random frame and stack it to create static clip $\mathbf{\bar{x}}^{(i)}_{\bar{t}}$. Both clips are passed through encoder $\mathcal{F}$ to generate prediction vectors $\mathbf{p}^{(i)}_t$ and $\mathbf{\bar{p}}^{(i)}_{\bar{t}}$. The adversarial loss (Eq. \ref{['eq:adv']}) is computed by taking the cross-entropy of the motion clip prediction $\mathbf{p}^{(i)}_{t}$ and subtracting the cross-entropy of the static clip prediction $\mathbf{\bar{p}}^{(i)}_{\bar{t}}$. This static prediction is encouraged to be uncertain by the entropy loss (Eq. \ref{['eq:entropy']}), and the gradients related to the prediction (shown in red, Eq. \ref{['eq:gradpen']}) are encouraged to be lower for more stable training by the gradient penalty loss.
  • Figure 2: Example clip from UCF101-SCUBA-Sinusoid protocol clip, corresponding to a video from the class "Skiing". (a) shows the frames from previous protocol, where snow is visible in the background. Our improved protocol (b) uses tight segmentation masks to eliminate the background.
  • Figure 3: Qualitative examples from the HMDB51 test set showing the baseline model choosing an incorrect action label due to spatial context. Our method correctly chooses the action label in each. These pixel-level attributions are plotted using integrated gradients sundararajan2017axiomatic
  • Figure 4: Example SCUBA/SCUFO/ConflFG frames from HMDB51. SCUBA evaluates background bias while SCUFO samples a single frame, evaluating foreground bias. ConflFG adds a static distractor foreground to a SCUBA video, evaluating both bias types simultaneously. Contrasted accuracy requires predicting the correct label on a SCUBA video AND predicting the incorrect action on its paired SCUFO video. More information can be found in StillMix li2023mitigating.
  • Figure 5: Example clips from UCF101-SCUBA-Places365 and UCF101-SCUBA-VQGAN protocols. (a) shows an example video from the class "Fencing" from the previous protocol. Our improved protocol (b) uses tight segmentation masks to eliminate background information. Likewise, (c) shows an example clip from the class "Golf Swing", and (d) shows the improved segmented version.
  • ...and 1 more figures