BEAR: A Video Dataset For Fine-grained Behaviors Recognition Oriented with Action and Environment Factors
Chengyang Hu, Yuduo Chen, Lizhuang Ma
TL;DR
BEAR tackles fine-grained video behavior recognition by decoupling environment and action factors and introduces two protocol families, FG-BSE and FG-BSA, with subprotocols to control for similar environments and similar actions. The dataset comprises diverse, well‑controlled scenarios with a broad train/test split and wild variations, enabling rigorous multi‑modal benchmarking. A comprehensive empirical study evaluates RGB, optical flow, skeleton, and text modalities across multiple models, including zero‑shot Text modality with VideoCLIP, revealing distinct role patterns: RGB primarily captures environment cues, optical flow emphasizes action and its environmental context, skeleton focuses on action, and text remains data‑limited but environment‑aware with leveled prompts. These findings offer practical guidance for feature learning and benchmark design in video understanding and motivate further dataset and modality research in fine‑grained behavior recognition.
Abstract
Behavior recognition is an important task in video representation learning. An essential aspect pertains to effective feature learning conducive to behavior recognition. Recently, researchers have started to study fine-grained behavior recognition, which provides similar behaviors and encourages the model to concern with more details of behaviors with effective features for distinction. However, previous fine-grained behaviors limited themselves to controlling partial information to be similar, leading to an unfair and not comprehensive evaluation of existing works. In this work, we develop a new video fine-grained behavior dataset, named BEAR, which provides fine-grained (i.e. similar) behaviors that uniquely focus on two primary factors defining behavior: Environment and Action. It includes two fine-grained behavior protocols including Fine-grained Behavior with Similar Environments and Fine-grained Behavior with Similar Actions as well as multiple sub-protocols as different scenarios. Furthermore, with this new dataset, we conduct multiple experiments with different behavior recognition models. Our research primarily explores the impact of input modality, a critical element in studying the environmental and action-based aspects of behavior recognition. Our experimental results yield intriguing insights that have substantial implications for further research endeavors.
