ProfVLM: A Lightweight Video-Language Model for Multi-View Proficiency Estimation
Edoardo Bianchi, Jacopo Staiano, Antonio Liotta
TL;DR
ProfVLM introduces a compact vision-language model for multi-view proficiency estimation that jointly predicts a proficiency label and generates expert-like feedback from egocentric and exocentric videos. It uses a frozen TimeSformer encoder, an AttentiveGatedProjector to fuse multi-view features, and a LoRA-tuned SmolLM2 language model to produce text, achieving state-of-the-art accuracy with only 5.3M parameters and 6 training epochs on EgoExo4D. The model reframes proficiency estimation as conditional language generation, enabling transparent, natural-language explanations alongside scores. This work demonstrates the viability and efficiency of integrating visual and linguistic reasoning for skill assessment, with strong semantic alignment of generated feedback and practical coaching applications.
Abstract
Existing approaches to skill proficiency estimation often rely on black-box video classifiers, ignoring multi-view context and lacking explainability. We present ProfVLM, a compact vision-language model that reformulates this task as generative reasoning: it jointly predicts skill level and generates expert-like feedback from egocentric and exocentric videos. Central to our method is an AttentiveGatedProjector that dynamically fuses multi-view features, projected from a frozen TimeSformer backbone into a language model tuned for feedback generation. Trained on EgoExo4D with expert commentaries, ProfVLM surpasses state-of-the-art methods while using up to 20x fewer parameters and reducing training time by up to 60%. Our approach not only achieves superior accuracy across diverse activities, but also outputs natural language critiques aligned with performance, offering transparent reasoning. These results highlight generative vision-language modeling as a powerful new direction for skill assessment.
