Table of Contents
Fetching ...

Every Shot Counts: Using Exemplars for Repetition Counting in Videos

Saptarshi Sinha, Alexandros Stergiou, Dima Damen

TL;DR

This work proposes an exemplar-based approach that discovers visual correspondence of video exemplars across repetitions within target videos, and proposes an attention-based encoder-decoder that encodes videos of varying lengths alongside exemplars from the same and different videos.

Abstract

Video repetition counting infers the number of repetitions of recurring actions or motion within a video. We propose an exemplar-based approach that discovers visual correspondence of video exemplars across repetitions within target videos. Our proposed Every Shot Counts (ESCounts) model is an attention-based encoder-decoder that encodes videos of varying lengths alongside exemplars from the same and different videos. In training, ESCounts regresses locations of high correspondence to the exemplars within the video. In tandem, our method learns a latent that encodes representations of general repetitive motions, which we use for exemplar-free, zero-shot inference. Extensive experiments over commonly used datasets (RepCount, Countix, and UCFRep) showcase ESCounts obtaining state-of-the-art performance across all three datasets. Detailed ablations further demonstrate the effectiveness of our method.

Every Shot Counts: Using Exemplars for Repetition Counting in Videos

TL;DR

This work proposes an exemplar-based approach that discovers visual correspondence of video exemplars across repetitions within target videos, and proposes an attention-based encoder-decoder that encodes videos of varying lengths alongside exemplars from the same and different videos.

Abstract

Video repetition counting infers the number of repetitions of recurring actions or motion within a video. We propose an exemplar-based approach that discovers visual correspondence of video exemplars across repetitions within target videos. Our proposed Every Shot Counts (ESCounts) model is an attention-based encoder-decoder that encodes videos of varying lengths alongside exemplars from the same and different videos. In training, ESCounts regresses locations of high correspondence to the exemplars within the video. In tandem, our method learns a latent that encodes representations of general repetitive motions, which we use for exemplar-free, zero-shot inference. Extensive experiments over commonly used datasets (RepCount, Countix, and UCFRep) showcase ESCounts obtaining state-of-the-art performance across all three datasets. Detailed ablations further demonstrate the effectiveness of our method.
Paper Structure (19 sections, 10 equations, 11 figures, 14 tables)

This paper contains 19 sections, 10 equations, 11 figures, 14 tables.

Figures (11)

  • Figure 1: VRC with ESCounts involves exemplars for relating information of the repeating action across the video. We visualise the density map with high relevance regions to the action push-up being highlighted, whilst regions of low relevance are not.
  • Figure 2: ESCounts Model overview. Bottom: Video $\mathbf{v}$ is encoded by $\mathcal{E}$ over sliding temporal windows to spatiotemporal latents $\mathbf{z}_{v} \in \mathbb{R}^{M \times C}$. Top Left: Exemplars $\{\mathbf{e}_{s}\}$ are also encoded with $\mathcal{E}$. Top Right: Video $\mathbf{z}_{v}$ and exemplar $\mathbf{z}_{s}$ latents are cross-attended by decoder $\mathcal{D}$ over $L$ cross-attention blocks. The resulting $\mathbf{z}_L \in \mathbb{R}^{M \times C}$ are attended over $L'$ window self-attention blocks and projected into density map $\tilde{\mathbf{d}}$. The decoder $\mathcal{D}$ is trained to regress the error between predicted $\tilde{\mathbf{d}}$ and ground truth $\mathbf{d}$ density maps. At inference, the count is obtained by summing $\tilde{\mathbf{d}}$.
  • Figure 3: Cross-Attention block. Video latents $\mathbf{z}_v$ are self-attended and then cross-attended with latents $\mathbf{z}_s$ from each exemplar $s \in \mathcal{S}$ and the learnt latent $\mathbf{z}_0$ with the same weights. The resulting representations are then averaged.
  • Figure 4: Shifted Density maps from each video, are meaned to $\tilde{\mathbf{d}}$.
  • Figure 5: RepCount, Countix, and UCFRep scatter plot, instances, and density maps. The dotted diagonal denotes correct predictions. We compare ESCounts against TransRAC on Repcount and Context on UCFRep. Action classes and count predictions are shown for each instance. We add the Ground Truth (GT) and Predicted (P) density maps per instance. Pseudo-labels are shown as GT for Countix.
  • ...and 6 more figures