Bridging the Visual-to-Physical Gap: Physically Aligned Representations for Fall Risk Analysis

Xianqi Zhang

Bridging the Visual-to-Physical Gap: Physically Aligned Representations for Fall Risk Analysis

Xianqi Zhang

Abstract

Vision-based fall analysis has advanced rapidly, but a key bottleneck remains: visually similarmotions can correspond to very different physical outcomes because small differences in contactmechanics and protective responses are hard to infer from appearance alone. Most existingapproaches handle this by supervised injury prediction, which depends on reliable injury labels.In practice, such labels are difficult to obtain: video evidence is often ambiguous (occlusion,viewpoint limits), and true injury events are rare and cannot be safely staged, leading to noisysupervision. We address this problem with PHARL (PHysics-aware Alignment RepresentationLearning), which learns physically meaningful fall representations without requiring clinicaloutcome labels. PHARL regularizes motion embeddings with two complementary constraints:(1) trajectory-level temporal consistency for stable representation learning, and (2) multi-classphysics alignment, where simulation-derived contact outcomes shape embedding geometry. Bypairing video windows with temporally aligned simulation descriptors, PHARL captures localimpact-relevant dynamics while keeping inference purely feed-forward. Experiments on fourpublic datasets show that PHARL consistently improves risk-aligned representation quality overvisual-only baselines while maintaining strong fall-detection performance. Notably, PHARL alsoexhibits zero-shot ordinality: an interpretable severity structure (Head > Trunk > Supported)emerges without explicit ordinal supervision.

Bridging the Visual-to-Physical Gap: Physically Aligned Representations for Fall Risk Analysis

Abstract

Paper Structure (27 sections, 8 equations, 4 figures, 3 tables)

This paper contains 27 sections, 8 equations, 4 figures, 3 tables.

Introduction
Related Work
Fall Detection and Outcome Analysis
Contrastive Representation Learning for Human Motion
Physics-Informed Learning for Human Dynamics
Method
Problem Formulation
Overview of PHARL Framework
Motion-Level Temporal Consistency
Physics-Level Outcome Consistency
Outcome Structure as Weak Supervision
Outcome Extraction and Denoising
Representation Learning Objectives
Training and Inference Workflow
Experiments
...and 12 more sections

Figures (4)

Figure 1: Overview of PHARL. Stage 1 (training only): RGB videos are processed offline to reconstruct motion and run physics simulation, producing window-level contact outcomes (Supported, Trunk, Head). Stage 2: PHARL applies two complementary constraints: trajectory-consistent temporal positives and physics-level contact structure across trajectories. Physics structure is used both for denominator masking in the trajectory loss and for auxiliary same-class attraction among contact windows. Stage 3: An encoder is trained with a composite objective (Eq. (\ref{['eq_3']})) that improves contact-aware geometry without adding an outcome prediction head. During inference, PHARL uses only the feed-forward RGB encoder, with no physics simulation.
Figure 2: Kernel density estimates of embedding distributions along the post-hoc severity axis (defined by the Supported and Head class centroids). Baseline methods show substantial class overlap, whereas PHARL exhibits a clearer staircase-like separation. This pattern indicates improved alignment between latent geometry and physics-consistent impact outcomes. The projection is used for analysis only and is not a prediction output.
Figure 3: Category-wise mean projection scores along the post-hoc severity axis across contact outcomes. PHARL preserves the expected physical ordering (Head $\succ$ Trunk $\succ$ Supported) with a larger margin than motion-only baselines, indicating stronger ordinal organization in the learned representation. These projection scores are descriptive diagnostics only, not predicted severity outputs.
Figure 4: Cross-video neighborhood consistency across physics contact categories. For each split, a class-balanced retrieval database is built, queries include all windows per class, and cosine nearest neighbors are retrieved after excluding same-video samples. Bars report row-normalized diagonal consistency (Supported@Supported, Trunk@Trunk, Head@Head) with $k=10$. PHARL consistently achieves higher same-class consistency, indicating stronger cross-video physical structure in the learned representation.

Bridging the Visual-to-Physical Gap: Physically Aligned Representations for Fall Risk Analysis

Abstract

Bridging the Visual-to-Physical Gap: Physically Aligned Representations for Fall Risk Analysis

Authors

Abstract

Table of Contents

Figures (4)