Table of Contents
Fetching ...

Poze: Sports Technique Feedback under Data Constraints

Agamdeep Singh, Sujit PB, Mayank Vatsa

TL;DR

Poze, an innovative video processing framework that provides feedback on human motion, emulating the insights of a professional coach, is introduced, which surpasses state-of-the-art vision-language models in video question-answering frameworks.

Abstract

Access to expert coaching is essential for developing technique in sports, yet economic barriers often place it out of reach for many enthusiasts. To bridge this gap, we introduce Poze, an innovative video processing framework that provides feedback on human motion, emulating the insights of a professional coach. Poze combines pose estimation with sequence comparison and is optimized to function effectively with minimal data. Poze surpasses state-of-the-art vision-language models in video question-answering frameworks, achieving 70% and 196% increase in accuracy over GPT4V and LLaVAv1.6 7b, respectively.

Poze: Sports Technique Feedback under Data Constraints

TL;DR

Poze, an innovative video processing framework that provides feedback on human motion, emulating the insights of a professional coach, is introduced, which surpasses state-of-the-art vision-language models in video question-answering frameworks.

Abstract

Access to expert coaching is essential for developing technique in sports, yet economic barriers often place it out of reach for many enthusiasts. To bridge this gap, we introduce Poze, an innovative video processing framework that provides feedback on human motion, emulating the insights of a professional coach. Poze combines pose estimation with sequence comparison and is optimized to function effectively with minimal data. Poze surpasses state-of-the-art vision-language models in video question-answering frameworks, achieving 70% and 196% increase in accuracy over GPT4V and LLaVAv1.6 7b, respectively.

Paper Structure

This paper contains 8 sections, 2 equations, 3 figures.

Figures (3)

  • Figure 1: Pose estimation takes as input a video frame $f_i$ as shown in (a) and returns the 3D pose $p_i$ as in (b).
  • Figure 2: During inference, videos are compared with the ideal technique representation to get attribute labels.
  • Figure 3: Accuracy comparison for modelled Attributes.