Flow Snapshot Neurons in Action: Deep Neural Networks Generalize to Biological Motion Perception

Shuangpeng Han; Ziyu Wang; Mengmi Zhang

Flow Snapshot Neurons in Action: Deep Neural Networks Generalize to Biological Motion Perception

Shuangpeng Han, Ziyu Wang, Mengmi Zhang

TL;DR

The Motion Perceiver is proposed, which solely relies on patch-level optical flows from video clips as inputs and outperforms all existing AI models with a maximum improvement of 29% in top-1 action recognition accuracy on these conditions.

Abstract

Biological motion perception (BMP) refers to humans' ability to perceive and recognize the actions of living beings solely from their motion patterns, sometimes as minimal as those depicted on point-light displays. While humans excel at these tasks without any prior training, current AI models struggle with poor generalization performance. To close this research gap, we propose the Motion Perceiver (MP). MP solely relies on patch-level optical flows from video clips as inputs. During training, it learns prototypical flow snapshots through a competitive binding mechanism and integrates invariant motion representations to predict action labels for the given video. During inference, we evaluate the generalization ability of all AI models and humans on 62,656 video stimuli spanning 24 BMP conditions using point-light displays in neuroscience. Remarkably, MP outperforms all existing AI models with a maximum improvement of 29% in top-1 action recognition accuracy on these conditions. Moreover, we benchmark all AI models in point-light displays of two standard video datasets in computer vision. MP also demonstrates superior performance in these cases. More interestingly, via psychophysics experiments, we found that MP recognizes biological movements in a way that aligns with human behaviors. Our data and code are available at https://github.com/ZhangLab-DeepNeuroCogLab/MotionPerceiver.

Flow Snapshot Neurons in Action: Deep Neural Networks Generalize to Biological Motion Perception

TL;DR

Abstract

Paper Structure (32 sections, 6 equations, 13 figures, 8 tables)

This paper contains 32 sections, 6 equations, 13 figures, 8 tables.

Introduction
Our Proposed Motion Perceiver (MP)
Patch-level Optical Flow
Flow Snapshot Neurons
Motion Invariant Neurons
Multi-scale Feature Fusion and Training
Experiments
Our Biological Motion Perception (BMP) Dataset with Human Behavioral Data
Video Action Recognition Datasets and Baselines in Computer Vision
Results
Our model achieves human-level performance without task-specific retraining
Comparisons among AI models in BMP tasks and standard computer vision datasets
Ablation studies reveal key components in our model
Discussion
Biological Motion Perception (BMP) Dataset
...and 17 more sections

Figures (13)

Figure 1: Humans excel at biological motion perception (BMP) tasks with zero training, while current AI models struggle with poor generalization performance. AI models are trained to recognize actions from natural RGB videos and tested using BMP stimuli on point-light displays, which come in two forms: Joint videos, which display only the detected joints of actors in white dots, and Sequential position actor videos (SP), where light points in white are randomly positioned between joints and reallocated to other random positions on the limb in subsequent frames (Sec. \ref{['sec:BMP dataset']}). Note that skeletons, shown in gray in the example video, are not visible to humans or AI models during testing. The generalization performance of both humans and models is assessed after varying five properties in temporal and visual dimensions. See Appendix, Sec. \ref{['sec:apx_example']} for example videos.
Figure 2: Architecture of our proposed Motion Perceiver (MP) model. Given a reference patch (yellow or green example patches), MP computes its patch-level optical flow (red arrows, Sec. \ref{['sec:patch-OF']}) on the feature maps extracted from DINO caron2021emerging. Subsequently, these flows are processed through flow snapshot neurons (Sec. \ref{['sec:FSN']}) and motion invariant neurons (Sec. \ref{['sec:MIN']}) in two pathways. Activations from both groups of neurons are then integrated for action classification (Sec. \ref{['sec:fusion_train']}). Time embeddings (T Emb.) are used in the feature fusion process.
Figure 3: Temporal orders and resolutions matter in generalization performance on RGB and Joint videos. Stimuli encompass RGB and Joint (J) videos. Short forms include R (reversal), S (shuffle), F (frames), and P (points) in Sec. \ref{['sec:BMP dataset']}. Error bars indicate the standard error of the top-1 accuracy across different action classes.
Figure 4: Our model demonstrates human-like robustness under reduced visual information. Top-1 action recognition accuracy is a function of the number of points (P) in Joint (J) videos. Results from RGB test videos are at the leftmost. The colored shaded region represents the standard error across all action classes.
Figure 5: Both humans and our model can recognize actions in SP videos without local motions. Performance varies depending on the persistence of visual information, with stimuli having 4 and 8 points (P) of the actors (Sec. \ref{['sec:BMP dataset']}).
...and 8 more figures

Flow Snapshot Neurons in Action: Deep Neural Networks Generalize to Biological Motion Perception

TL;DR

Abstract

Flow Snapshot Neurons in Action: Deep Neural Networks Generalize to Biological Motion Perception

Authors

TL;DR

Abstract

Table of Contents

Figures (13)