Table of Contents
Fetching ...

HVIS: A Human-like Vision and Inference System for Human Motion Prediction

Kedi Lyu, Haipeng Chen, Zhenguang Liu, Yifang Yin, Yukang Lin, Yingying Jiao

TL;DR

This work tackles human motion prediction (HMP) by emulating human perception and learning in a two-module architecture called HVIS, consisting of a Vision module (HVE) and an Inference module (HMI). The vision path separates spatial and temporal information via a retina-inspired RA component and a visual cortex-like VA component to capture global and local pose cues. The inference path uses spontaneous learning with a joint-level adversarial generator and a deliberate learning stage focusing on hard-to-train joints, implemented as SLN and DLN with a memory component and targeted training. The approach yields state-of-the-art results on Human3.6M, CMU MoCap, and G3D, with substantial improvements over baselines and strong qualitative predictions, highlighting the potential of brain-inspired mechanisms for robust HMP. Overall, HVIS demonstrates that decoupled, hierarchical visual encoding combined with staged learning can effectively model the complex spatio-temporal dynamics of human motion.

Abstract

Grasping the intricacies of human motion, which involve perceiving spatio-temporal dependence and multi-scale effects, is essential for predicting human motion. While humans inherently possess the requisite skills to navigate this issue, it proves to be markedly more challenging for machines to emulate. To bridge the gap, we propose the Human-like Vision and Inference System (HVIS) for human motion prediction, which is designed to emulate human observation and forecast future movements. HVIS comprises two components: the human-like vision encode (HVE) module and the human-like motion inference (HMI) module. The HVE module mimics and refines the human visual process, incorporating a retina-analog component that captures spatiotemporal information separately to avoid unnecessary crosstalk. Additionally, a visual cortex-analogy component is designed to hierarchically extract and treat complex motion features, focusing on both global and local features of human poses. The HMI is employed to simulate the multi-stage learning model of the human brain. The spontaneous learning network simulates the neuronal fracture generation process for the adversarial generation of future motions. Subsequently, the deliberate learning network is optimized for hard-to-train joints to prevent misleading learning. Experimental results demonstrate that our method achieves new state-of-the-art performance, significantly outperforming existing methods by 19.8% on Human3.6M, 15.7% on CMU Mocap, and 11.1% on G3D.

HVIS: A Human-like Vision and Inference System for Human Motion Prediction

TL;DR

This work tackles human motion prediction (HMP) by emulating human perception and learning in a two-module architecture called HVIS, consisting of a Vision module (HVE) and an Inference module (HMI). The vision path separates spatial and temporal information via a retina-inspired RA component and a visual cortex-like VA component to capture global and local pose cues. The inference path uses spontaneous learning with a joint-level adversarial generator and a deliberate learning stage focusing on hard-to-train joints, implemented as SLN and DLN with a memory component and targeted training. The approach yields state-of-the-art results on Human3.6M, CMU MoCap, and G3D, with substantial improvements over baselines and strong qualitative predictions, highlighting the potential of brain-inspired mechanisms for robust HMP. Overall, HVIS demonstrates that decoupled, hierarchical visual encoding combined with staged learning can effectively model the complex spatio-temporal dynamics of human motion.

Abstract

Grasping the intricacies of human motion, which involve perceiving spatio-temporal dependence and multi-scale effects, is essential for predicting human motion. While humans inherently possess the requisite skills to navigate this issue, it proves to be markedly more challenging for machines to emulate. To bridge the gap, we propose the Human-like Vision and Inference System (HVIS) for human motion prediction, which is designed to emulate human observation and forecast future movements. HVIS comprises two components: the human-like vision encode (HVE) module and the human-like motion inference (HMI) module. The HVE module mimics and refines the human visual process, incorporating a retina-analog component that captures spatiotemporal information separately to avoid unnecessary crosstalk. Additionally, a visual cortex-analogy component is designed to hierarchically extract and treat complex motion features, focusing on both global and local features of human poses. The HMI is employed to simulate the multi-stage learning model of the human brain. The spontaneous learning network simulates the neuronal fracture generation process for the adversarial generation of future motions. Subsequently, the deliberate learning network is optimized for hard-to-train joints to prevent misleading learning. Experimental results demonstrate that our method achieves new state-of-the-art performance, significantly outperforming existing methods by 19.8% on Human3.6M, 15.7% on CMU Mocap, and 11.1% on G3D.

Paper Structure

This paper contains 11 sections, 12 equations, 3 figures, 4 tables.

Figures (3)

  • Figure 1: Problem illustration.
  • Figure 2: HVIS including two components: Human-like vision module (HVM) and Human-like inference module (HIM).
  • Figure 3: Visual comparisons on H3.6M dataset (Purchase) and CMU dataset (Soccer). The blue poses are the ground truth.