Table of Contents
Fetching ...

Flow Snapshot Neurons in Action: Deep Neural Networks Generalize to Biological Motion Perception

Shuangpeng Han, Ziyu Wang, Mengmi Zhang

TL;DR

The Motion Perceiver is proposed, which solely relies on patch-level optical flows from video clips as inputs and outperforms all existing AI models with a maximum improvement of 29% in top-1 action recognition accuracy on these conditions.

Abstract

Biological motion perception (BMP) refers to humans' ability to perceive and recognize the actions of living beings solely from their motion patterns, sometimes as minimal as those depicted on point-light displays. While humans excel at these tasks without any prior training, current AI models struggle with poor generalization performance. To close this research gap, we propose the Motion Perceiver (MP). MP solely relies on patch-level optical flows from video clips as inputs. During training, it learns prototypical flow snapshots through a competitive binding mechanism and integrates invariant motion representations to predict action labels for the given video. During inference, we evaluate the generalization ability of all AI models and humans on 62,656 video stimuli spanning 24 BMP conditions using point-light displays in neuroscience. Remarkably, MP outperforms all existing AI models with a maximum improvement of 29% in top-1 action recognition accuracy on these conditions. Moreover, we benchmark all AI models in point-light displays of two standard video datasets in computer vision. MP also demonstrates superior performance in these cases. More interestingly, via psychophysics experiments, we found that MP recognizes biological movements in a way that aligns with human behaviors. Our data and code are available at https://github.com/ZhangLab-DeepNeuroCogLab/MotionPerceiver.

Flow Snapshot Neurons in Action: Deep Neural Networks Generalize to Biological Motion Perception

TL;DR

The Motion Perceiver is proposed, which solely relies on patch-level optical flows from video clips as inputs and outperforms all existing AI models with a maximum improvement of 29% in top-1 action recognition accuracy on these conditions.

Abstract

Biological motion perception (BMP) refers to humans' ability to perceive and recognize the actions of living beings solely from their motion patterns, sometimes as minimal as those depicted on point-light displays. While humans excel at these tasks without any prior training, current AI models struggle with poor generalization performance. To close this research gap, we propose the Motion Perceiver (MP). MP solely relies on patch-level optical flows from video clips as inputs. During training, it learns prototypical flow snapshots through a competitive binding mechanism and integrates invariant motion representations to predict action labels for the given video. During inference, we evaluate the generalization ability of all AI models and humans on 62,656 video stimuli spanning 24 BMP conditions using point-light displays in neuroscience. Remarkably, MP outperforms all existing AI models with a maximum improvement of 29% in top-1 action recognition accuracy on these conditions. Moreover, we benchmark all AI models in point-light displays of two standard video datasets in computer vision. MP also demonstrates superior performance in these cases. More interestingly, via psychophysics experiments, we found that MP recognizes biological movements in a way that aligns with human behaviors. Our data and code are available at https://github.com/ZhangLab-DeepNeuroCogLab/MotionPerceiver.
Paper Structure (32 sections, 6 equations, 13 figures, 8 tables)

This paper contains 32 sections, 6 equations, 13 figures, 8 tables.

Figures (13)

  • Figure 1: Humans excel at biological motion perception (BMP) tasks with zero training, while current AI models struggle with poor generalization performance. AI models are trained to recognize actions from natural RGB videos and tested using BMP stimuli on point-light displays, which come in two forms: Joint videos, which display only the detected joints of actors in white dots, and Sequential position actor videos (SP), where light points in white are randomly positioned between joints and reallocated to other random positions on the limb in subsequent frames (Sec. \ref{['sec:BMP dataset']}). Note that skeletons, shown in gray in the example video, are not visible to humans or AI models during testing. The generalization performance of both humans and models is assessed after varying five properties in temporal and visual dimensions. See Appendix, Sec. \ref{['sec:apx_example']} for example videos.
  • Figure 2: Architecture of our proposed Motion Perceiver (MP) model. Given a reference patch (yellow or green example patches), MP computes its patch-level optical flow (red arrows, Sec. \ref{['sec:patch-OF']}) on the feature maps extracted from DINO caron2021emerging. Subsequently, these flows are processed through flow snapshot neurons (Sec. \ref{['sec:FSN']}) and motion invariant neurons (Sec. \ref{['sec:MIN']}) in two pathways. Activations from both groups of neurons are then integrated for action classification (Sec. \ref{['sec:fusion_train']}). Time embeddings (T Emb.) are used in the feature fusion process.
  • Figure 3: Temporal orders and resolutions matter in generalization performance on RGB and Joint videos. Stimuli encompass RGB and Joint (J) videos. Short forms include R (reversal), S (shuffle), F (frames), and P (points) in Sec. \ref{['sec:BMP dataset']}. Error bars indicate the standard error of the top-1 accuracy across different action classes.
  • Figure 4: Our model demonstrates human-like robustness under reduced visual information. Top-1 action recognition accuracy is a function of the number of points (P) in Joint (J) videos. Results from RGB test videos are at the leftmost. The colored shaded region represents the standard error across all action classes.
  • Figure 5: Both humans and our model can recognize actions in SP videos without local motions. Performance varies depending on the persistence of visual information, with stimuli having 4 and 8 points (P) of the actors (Sec. \ref{['sec:BMP dataset']}).
  • ...and 8 more figures