Learning Consistent Temporal Grounding between Related Tasks in Sports Coaching

Arushi Rai; Adriana Kovashka

Learning Consistent Temporal Grounding between Related Tasks in Sports Coaching

Arushi Rai, Adriana Kovashka

Abstract

Video-LLMs often attend to irrelevant frames, which is especially detrimental for sports coaching tasks requiring precise temporal grounding. Yet obtaining frame-level supervision is challenging: expensive to collect from humans and unreliable from other models. We improve temporal grounding without additional annotations by exploiting the observation that related tasks, such as generation and verification, must attend to the same frames. We enforce this via a self-consistency objective over select visual attention maps of tightly-related tasks. Using VidDiffBench, which provides ground-truth keyframe annotations, we first validate that attention misallocation is a significant bottleneck. We then show that training with our objective yields gains of +3.0%, +14.1% accuracy and +0.9 BERTScore over supervised finetuning across three sports coaching tasks: Exact, FitnessQA, and ExpertAF, even surpassing closed-source models.

Learning Consistent Temporal Grounding between Related Tasks in Sports Coaching

Abstract

Paper Structure (19 sections, 12 equations, 5 figures, 8 tables)

This paper contains 19 sections, 12 equations, 5 figures, 8 tables.

Introduction
Related Works
Method
Preliminaries
Dual-Pathway Forward Pass
Visual Layer and Head Selection
Head Bipartite Matching
Losses
Experiments
Which attention modules are sensitive to temporal grounding?
Can temporal grounding improve performance independently of visual representation?
Does targeting temporal grounding sensitive modules with our method improve sports coaching performance?
Ablation Studies
Conclusion
Visual Non-Sink Ratio
...and 4 more sections

Figures (5)

Figure 1: Both the generation task (providing feedback on technique) and the verification task (confirming a given feedback statement) require the model to attend to the same keyframes (highlighted in orange) to produce a correct response. We exploit this as a self-supervision signal, enforcing attentional consistency between the two complementary task-views without requiring any frame-level annotations.
Figure 2: Overview of the proposed dual-pathway self-consistency framework. The Generation Pathway and Verification Pathway process the same video through a shared visual encoder and Vid-LLM Decoder ($f_\theta$) with different task prompts. From a selected visual layer (Sec. \ref{['sec:analysis']}), top-$K$ attention heads are identified via their Visual Non-Sink Ratio kang2025see and paired across pathways via Head Bipartite Matching (exact index matching + Hungarian algorithm). The generation pathway's attention maps are regularized by $\mathcal{L}_{entropy}$ for focused attention over specific visual tokens (and frames), while $\mathcal{L}_{consistency}$ enforces cross-pathway attentional agreement between matched heads. $\mathcal{L}_{ce}$ serves as regularization on the generation pathway output.
Figure 3: Visual attention quality score visualization: product of attention sharpness ($1-\text{entropy}$), vision centricity (visual attention sum), and keyframe overlap (AUROC). Both graphs show that layer 8 and head 7 are relatively better at temporal grounding.
Figure 4: Attention redistribution strategies. Left: The initial attention distribution is diffuse and noisy across the entire visual sequence. Right: Proportional redistribution concentrates the attention weights onto specific keyframe tokens while preserving their original relative importance.
Figure 5: Selected frames from the ExpertAF clips in Table \ref{['tab:supp_feedback_qual']}.

Learning Consistent Temporal Grounding between Related Tasks in Sports Coaching

Abstract

Learning Consistent Temporal Grounding between Related Tasks in Sports Coaching

Authors

Abstract

Table of Contents

Figures (5)