Improving action classification with brain-inspired deep networks
Aidas Aglinskas, Stefano Anzellotti
TL;DR
The paper investigates how body and background information contribute to action recognition in humans and deep networks, and whether brain-inspired, category-selective architectures can yield more human-like performance. It demonstrates that standard single-stream networks tend to rely on background cues, while humans rely more on body pose; a brain-inspired two-stream architecture with separate body and scene processing improves accuracy and aligns with human performance patterns. Using the HAA500 dataset and a joint loss across body, background, and combined outputs, the study shows significant gains in generalization and more human-like responses, highlighting the value of domain-specific processing for robust action understanding. These findings advance cognitive neuroscience and machine learning by offering a practical, biologically inspired design principle for more accurate and generalizable action-recognition systems.
Abstract
Action recognition is also key for applications ranging from robotics to healthcare monitoring. Action information can be extracted from the body pose and movements, as well as from the background scene. However, the extent to which deep neural networks (DNNs) make use of information about the body and information about the background remains unclear. Since these two sources of information may be correlated within a training dataset, DNNs might learn to rely predominantly on one of them, without taking full advantage of the other. Unlike DNNs, humans have domain-specific brain regions selective for perceiving bodies, and regions selective for perceiving scenes. The present work tests whether humans are thus more effective at extracting information from both body and background, and whether building brain-inspired deep network architectures with separate domain-specific streams for body and scene perception endows them with more human-like performance. We first demonstrate that DNNs trained using the HAA500 dataset perform almost as accurately on versions of the stimuli that show both body and background and on versions of the stimuli from which the body was removed, but are at chance-level for versions of the stimuli from which the background was removed. Conversely, human participants (N=28) can recognize the same set of actions accurately with all three versions of the stimuli, and perform significantly better on stimuli that show only the body than on stimuli that show only the background. Finally, we implement and test a novel architecture patterned after domain specificity in the brain with separate streams to process body and background information. We show that 1) this architecture improves action recognition performance, and 2) its accuracy across different versions of the stimuli follows a pattern that matches more closely the pattern of accuracy observed in human participants.
