Table of Contents
Fetching ...

From 3D Pose to Prose: Biomechanics-Grounded Vision--Language Coaching

Yuyang Ji, Yixuan Shen, Shengjie Zhu, Yu Kong, Feng Liu

Abstract

We present BioCoach, a biomechanics-grounded vision--language framework for fitness coaching from streaming video. BioCoach fuses visual appearance and 3D skeletal kinematics, through a novel three-stage pipeline: an exercise-specific degree-of-freedom selector that focuses analysis on salient joints; a structured biomechanical context that pairs individualized morphometrics with cycle and constraint analysis; and a vision--biomechanics conditioned feedback module that applies cross-attention to generate precise, actionable text. Using parameter-efficient training that freezes the vision and language backbones, BioCoach yields transparent, personalized reasoning rather than pattern matching. To enable learning and fair evaluation, we augment QEVD-fit-coach with biomechanics-oriented feedback to create QEVD-bio-fit-coach, and we introduce a biomechanics-aware LLM judge metric. BioCoach delivers clear gains on QEVD-bio-fit-coach across lexical and judgment metrics while maintaining temporal triggering; on the original QEVD-fit-coach, it improves text quality and correctness with near-parity timing, demonstrating that explicit kinematics and constraints are key to accurate, phase-aware coaching.

From 3D Pose to Prose: Biomechanics-Grounded Vision--Language Coaching

Abstract

We present BioCoach, a biomechanics-grounded vision--language framework for fitness coaching from streaming video. BioCoach fuses visual appearance and 3D skeletal kinematics, through a novel three-stage pipeline: an exercise-specific degree-of-freedom selector that focuses analysis on salient joints; a structured biomechanical context that pairs individualized morphometrics with cycle and constraint analysis; and a vision--biomechanics conditioned feedback module that applies cross-attention to generate precise, actionable text. Using parameter-efficient training that freezes the vision and language backbones, BioCoach yields transparent, personalized reasoning rather than pattern matching. To enable learning and fair evaluation, we augment QEVD-fit-coach with biomechanics-oriented feedback to create QEVD-bio-fit-coach, and we introduce a biomechanics-aware LLM judge metric. BioCoach delivers clear gains on QEVD-bio-fit-coach across lexical and judgment metrics while maintaining temporal triggering; on the original QEVD-fit-coach, it improves text quality and correctness with near-parity timing, demonstrating that explicit kinematics and constraints are key to accurate, phase-aware coaching.

Paper Structure

This paper contains 34 sections, 16 equations, 4 figures, 3 tables.

Figures (4)

  • Figure 1: Comparison with existing methods. Top: prior pixel-only VLM methods provide generic, loosely timed comments. Bottom: BioCoach fuses visual features with 3D skeletal kinematics and a biomechanics module to produce phase-aligned, anatomy-specific, quantitative cues (e.g., shoulder flexion 160$^\circ$–170$^\circ$), yielding more precise and biomechanics-grounded feedback along the same timeline.
  • Figure 2: BioCoach overview. Streaming video is encoded by two backbones: a 3D CNN for visual tokens and a pose extractor for 3D skeletal kinematics. The pipeline has three components: (1) Exercise-Specific DoF Selection uses a lightweight attention head to select the top $K$ biomechanically salient joints; (2) Structured Biomechanical Context builds two representations (individual morphometric context and motion quality context) capturing body measurements, cycles, ranges of motion, and constraint checks; (3) Vision--Biomechanics Conditioned Feedback fuses visual tokens with the morphometric context via cross-attention and prepends the motion-quality context as structured instruction to the LLM. This yields feedback grounded in explicit kinematic evidence rather than pattern matching alone.
  • Figure 3: Motion-Quality Context module. Given the selected joint set and the 3D skeletal kinematics, the module (a) detects repetition cycles and anchors the feedback moment; (b) time-normalizes each cycle and aligns it to a curated reference trajectory; and (c) evaluates biomechanical constraints: stability for static joints and deviation to reference for dynamic joints. Gray curves denote the reference, blue curves denote the user.
  • Figure 4: Qualitative timeline for a squat exercise. BioCoach produces temporally aligned, biomechanics-grounded cues with consistent phase tracking, while Stream-VLM outputs generic or mistimed feedback inconsistent with the ground-truth annotations.