Table of Contents
Fetching ...

Seeing Beyond the Scene: Analyzing and Mitigating Background Bias in Action Recognition

Ellie Zhou, Jihoon Chung, Olga Russakovsky

TL;DR

The paper analyzes background bias in action recognition across classification models, contrastive text-image learners (CLIP, SigLIP), and video LLMs, finding pervasive background reliance in all. It proposes mitigation for classification models through segmented human input and multi-branch architectures, and demonstrates that prompt design—especially automated prompt tuning—can steer VLLMs toward human-focused reasoning. Key findings include a 3.78% maximum reduction in background bias for segmentation-based methods, and up to 9.85% SBErr reduction via automated prompting in VLLMs. The work highlights the trade-offs between removing background cues and maintaining accuracy on context-rich data, and suggests automated prompt tuning as a promising direction for robust, bias-aware video understanding.

Abstract

Human action recognition models often rely on background cues rather than human movement and pose to make predictions, a behavior known as background bias. We present a systematic analysis of background bias across classification models, contrastive text-image pretrained models, and Video Large Language Models (VLLM) and find that all exhibit a strong tendency to default to background reasoning. Next, we propose mitigation strategies for classification models and show that incorporating segmented human input effectively decreases background bias by 3.78%. Finally, we explore manual and automated prompt tuning for VLLMs, demonstrating that prompt design can steer predictions towards human-focused reasoning by 9.85%.

Seeing Beyond the Scene: Analyzing and Mitigating Background Bias in Action Recognition

TL;DR

The paper analyzes background bias in action recognition across classification models, contrastive text-image learners (CLIP, SigLIP), and video LLMs, finding pervasive background reliance in all. It proposes mitigation for classification models through segmented human input and multi-branch architectures, and demonstrates that prompt design—especially automated prompt tuning—can steer VLLMs toward human-focused reasoning. Key findings include a 3.78% maximum reduction in background bias for segmentation-based methods, and up to 9.85% SBErr reduction via automated prompting in VLLMs. The work highlights the trade-offs between removing background cues and maintaining accuracy on context-rich data, and suggests automated prompt tuning as a promising direction for robust, bias-aware video understanding.

Abstract

Human action recognition models often rely on background cues rather than human movement and pose to make predictions, a behavior known as background bias. We present a systematic analysis of background bias across classification models, contrastive text-image pretrained models, and Video Large Language Models (VLLM) and find that all exhibit a strong tendency to default to background reasoning. Next, we propose mitigation strategies for classification models and show that incorporating segmented human input effectively decreases background bias by 3.78%. Finally, we explore manual and automated prompt tuning for VLLMs, demonstrating that prompt design can steer predictions towards human-focused reasoning by 9.85%.

Paper Structure

This paper contains 21 sections, 1 equation, 1 figure, 3 tables.

Figures (1)

  • Figure 1: (a) Effect of model size on InternVL3. Increased model capacity improve SHAcc only. (b) Effect of number of frames. Temporal information increases SHAcc and decreases SBErr. (c) Performance of GPT prompts. Automated prompt tuning better reduces background bias.