Table of Contents
Fetching ...

Improving action classification with brain-inspired deep networks

Aidas Aglinskas, Stefano Anzellotti

TL;DR

The paper investigates how body and background information contribute to action recognition in humans and deep networks, and whether brain-inspired, category-selective architectures can yield more human-like performance. It demonstrates that standard single-stream networks tend to rely on background cues, while humans rely more on body pose; a brain-inspired two-stream architecture with separate body and scene processing improves accuracy and aligns with human performance patterns. Using the HAA500 dataset and a joint loss across body, background, and combined outputs, the study shows significant gains in generalization and more human-like responses, highlighting the value of domain-specific processing for robust action understanding. These findings advance cognitive neuroscience and machine learning by offering a practical, biologically inspired design principle for more accurate and generalizable action-recognition systems.

Abstract

Action recognition is also key for applications ranging from robotics to healthcare monitoring. Action information can be extracted from the body pose and movements, as well as from the background scene. However, the extent to which deep neural networks (DNNs) make use of information about the body and information about the background remains unclear. Since these two sources of information may be correlated within a training dataset, DNNs might learn to rely predominantly on one of them, without taking full advantage of the other. Unlike DNNs, humans have domain-specific brain regions selective for perceiving bodies, and regions selective for perceiving scenes. The present work tests whether humans are thus more effective at extracting information from both body and background, and whether building brain-inspired deep network architectures with separate domain-specific streams for body and scene perception endows them with more human-like performance. We first demonstrate that DNNs trained using the HAA500 dataset perform almost as accurately on versions of the stimuli that show both body and background and on versions of the stimuli from which the body was removed, but are at chance-level for versions of the stimuli from which the background was removed. Conversely, human participants (N=28) can recognize the same set of actions accurately with all three versions of the stimuli, and perform significantly better on stimuli that show only the body than on stimuli that show only the background. Finally, we implement and test a novel architecture patterned after domain specificity in the brain with separate streams to process body and background information. We show that 1) this architecture improves action recognition performance, and 2) its accuracy across different versions of the stimuli follows a pattern that matches more closely the pattern of accuracy observed in human participants.

Improving action classification with brain-inspired deep networks

TL;DR

The paper investigates how body and background information contribute to action recognition in humans and deep networks, and whether brain-inspired, category-selective architectures can yield more human-like performance. It demonstrates that standard single-stream networks tend to rely on background cues, while humans rely more on body pose; a brain-inspired two-stream architecture with separate body and scene processing improves accuracy and aligns with human performance patterns. Using the HAA500 dataset and a joint loss across body, background, and combined outputs, the study shows significant gains in generalization and more human-like responses, highlighting the value of domain-specific processing for robust action understanding. These findings advance cognitive neuroscience and machine learning by offering a practical, biologically inspired design principle for more accurate and generalizable action-recognition systems.

Abstract

Action recognition is also key for applications ranging from robotics to healthcare monitoring. Action information can be extracted from the body pose and movements, as well as from the background scene. However, the extent to which deep neural networks (DNNs) make use of information about the body and information about the background remains unclear. Since these two sources of information may be correlated within a training dataset, DNNs might learn to rely predominantly on one of them, without taking full advantage of the other. Unlike DNNs, humans have domain-specific brain regions selective for perceiving bodies, and regions selective for perceiving scenes. The present work tests whether humans are thus more effective at extracting information from both body and background, and whether building brain-inspired deep network architectures with separate domain-specific streams for body and scene perception endows them with more human-like performance. We first demonstrate that DNNs trained using the HAA500 dataset perform almost as accurately on versions of the stimuli that show both body and background and on versions of the stimuli from which the body was removed, but are at chance-level for versions of the stimuli from which the background was removed. Conversely, human participants (N=28) can recognize the same set of actions accurately with all three versions of the stimuli, and perform significantly better on stimuli that show only the body than on stimuli that show only the background. Finally, we implement and test a novel architecture patterned after domain specificity in the brain with separate streams to process body and background information. We show that 1) this architecture improves action recognition performance, and 2) its accuracy across different versions of the stimuli follows a pattern that matches more closely the pattern of accuracy observed in human participants.

Paper Structure

This paper contains 18 sections, 1 equation, 5 figures, 3 tables.

Figures (5)

  • Figure 1: Comparison between network trained using original frames (Baseline:frames model), human results and brain-inspired two stream network (DomainNet:frames). Baseline:frames model performs similarly well when tested using original (ORIG) or background-only (BG.) frames, and at chance-level when tested using body-only (Body) frames. Conversely human participants (N=28) perform similarly well when tested using original and body-only stimuli, with lower performance during background-only trials. Our architecture, patterned after domain-specific pathways in the brain - exhibits 1) higher accuracies across all versions of the stimuli (original, background-only, body-only) and 2) Performs similarly well when tested using original or body-only frames similar to human participants. Note: In order to facilitate results comparison, network accuracy was calculated considering only the categories and answer choices that were available to human participants (50 categories, 5 answer choices).
  • Figure S1: Example stimuli.
  • Figure S2: Example frame and corresponding flow used to train Baseline:frames+flows model
  • Figure S3: Example frames and corresponding flows used to train DomainNet:frames+flows model
  • Figure S4: Training dynamics for networks tested. Training loss and training accuracy for both Baseline model and DomainNet converged in fewer than 20 epochs.