Table of Contents
Fetching ...

Predicting 3D Motion from 2D Video for Behavior-Based VR Biometrics

Mingjun Li, Natasha Kholgade Banerjee, Sean Banerjee

TL;DR

The paper addresses VR biometric authentication by overcoming limited on-device joint tracking through predicting 3D right-controller motion from 2D body joints extracted from external video. It introduces a two-part neural architecture: a Transformer-based trajectory predictor $M_{traj}$ that maps $w_{in}$-length 2D joint sequences to a $w$-length 3D trajectory, and an FCN-based authenticator $M_{auth}$ that classifies users using the predicted trajectory. Using the Miller et al. ball-throwing dataset and six joints from OpenPose, the approach achieves a minimum $EER$ of $0.025$ and outperforms the Li et al. baseline, with an average $EER$ reduction of $0.039$ across configurations. The work demonstrates that external video data capturing broader body articulation significantly enhances VR biometrics, suggesting future expansion to additional joints and higher-fidelity 2D-to-3D hand-tracking signals for broader VR tasks.

Abstract

Critical VR applications in domains such as healthcare, education, and finance that use traditional credentials, such as PIN, password, or multi-factor authentication, stand the chance of being compromised if a malicious person acquires the user credentials or if the user hands over their credentials to an ally. Recently, a number of approaches on user authentication have emerged that use motions of VR head-mounted displays (HMDs) and hand controllers during user interactions in VR to represent the user's behavior as a VR biometric signature. One of the fundamental limitations of behavior-based approaches is that current on-device tracking for HMDs and controllers lacks capability to perform tracking of full-body joint articulation, losing key signature data encapsulated by the user articulation. In this paper, we propose an approach that uses 2D body joints, namely shoulder, elbow, wrist, hip, knee, and ankle, acquired from the right side of the participants using an external 2D camera. Using a Transformer-based deep neural network, our method uses the 2D data of body joints that are not tracked by the VR device to predict past and future 3D tracks of the right controller, providing the benefit of augmenting 3D knowledge in authentication. Our approach provides a minimum equal error rate (EER) of 0.025, and a maximum EER drop of 0.040 over prior work that uses single-unit 3D trajectory as the input.

Predicting 3D Motion from 2D Video for Behavior-Based VR Biometrics

TL;DR

The paper addresses VR biometric authentication by overcoming limited on-device joint tracking through predicting 3D right-controller motion from 2D body joints extracted from external video. It introduces a two-part neural architecture: a Transformer-based trajectory predictor that maps -length 2D joint sequences to a -length 3D trajectory, and an FCN-based authenticator that classifies users using the predicted trajectory. Using the Miller et al. ball-throwing dataset and six joints from OpenPose, the approach achieves a minimum of and outperforms the Li et al. baseline, with an average reduction of across configurations. The work demonstrates that external video data capturing broader body articulation significantly enhances VR biometrics, suggesting future expansion to additional joints and higher-fidelity 2D-to-3D hand-tracking signals for broader VR tasks.

Abstract

Critical VR applications in domains such as healthcare, education, and finance that use traditional credentials, such as PIN, password, or multi-factor authentication, stand the chance of being compromised if a malicious person acquires the user credentials or if the user hands over their credentials to an ally. Recently, a number of approaches on user authentication have emerged that use motions of VR head-mounted displays (HMDs) and hand controllers during user interactions in VR to represent the user's behavior as a VR biometric signature. One of the fundamental limitations of behavior-based approaches is that current on-device tracking for HMDs and controllers lacks capability to perform tracking of full-body joint articulation, losing key signature data encapsulated by the user articulation. In this paper, we propose an approach that uses 2D body joints, namely shoulder, elbow, wrist, hip, knee, and ankle, acquired from the right side of the participants using an external 2D camera. Using a Transformer-based deep neural network, our method uses the 2D data of body joints that are not tracked by the VR device to predict past and future 3D tracks of the right controller, providing the benefit of augmenting 3D knowledge in authentication. Our approach provides a minimum equal error rate (EER) of 0.025, and a maximum EER drop of 0.040 over prior work that uses single-unit 3D trajectory as the input.

Paper Structure

This paper contains 16 sections, 3 equations, 3 figures, 2 tables.

Figures (3)

  • Figure 1: We extract 6 body joints, i.e., right shoulder (solid red), elbow (solid green), wrist (solid blue), hip (hollow red), knee (hollow green), and ankle (hollow blue), by OpenPose from 2D images in time range $t$ to $t+w_\textrm{in}$ (with length $w_\textrm{in}$). We then feed the body joints image coordinates into the forecasting model, which outputs the 3D trajectory sequence from time $t$ to $t+w$ (with length $w$). The 3D predicted trajectory serves as the direct input for the authentication model.
  • Figure 2: Trajectory Prediction Model.
  • Figure 3: Authentication Model.