Table of Contents
Fetching ...

Learning Activity View-invariance Under Extreme Viewpoint Changes via Curriculum Knowledge Distillation

Arjun Somayazulu, Efi Mavroudi, Changan Chen, Lorenzo Torresani, Kristen Grauman

TL;DR

This work defines a geometry-based metric that ranks views at a fine-grained temporal scale by their likely occlusion level and formulates a knowledge distillation objective that preserves action-centric semantics with a novel curriculum learning procedure that pairs incrementally more challenging views over time, thereby allowing smooth adaptation to extreme viewpoint differences.

Abstract

Traditional methods for view-invariant learning from video rely on controlled multi-view settings with minimal scene clutter. However, they struggle with in-the-wild videos that exhibit extreme viewpoint differences and share little visual content. We introduce a method for learning rich video representations in the presence of such severe view-occlusions. We first define a geometry-based metric that ranks views at a fine-grained temporal scale by their likely occlusion level. Then, using those rankings, we formulate a knowledge distillation objective that preserves action-centric semantics with a novel curriculum learning procedure that pairs incrementally more challenging views over time, thereby allowing smooth adaptation to extreme viewpoint differences. We evaluate our approach on two tasks, outperforming SOTA models on both temporal keystep grounding and fine-grained keystep recognition benchmarks - particularly on views that exhibit severe occlusion.

Learning Activity View-invariance Under Extreme Viewpoint Changes via Curriculum Knowledge Distillation

TL;DR

This work defines a geometry-based metric that ranks views at a fine-grained temporal scale by their likely occlusion level and formulates a knowledge distillation objective that preserves action-centric semantics with a novel curriculum learning procedure that pairs incrementally more challenging views over time, thereby allowing smooth adaptation to extreme viewpoint differences.

Abstract

Traditional methods for view-invariant learning from video rely on controlled multi-view settings with minimal scene clutter. However, they struggle with in-the-wild videos that exhibit extreme viewpoint differences and share little visual content. We introduce a method for learning rich video representations in the presence of such severe view-occlusions. We first define a geometry-based metric that ranks views at a fine-grained temporal scale by their likely occlusion level. Then, using those rankings, we formulate a knowledge distillation objective that preserves action-centric semantics with a novel curriculum learning procedure that pairs incrementally more challenging views over time, thereby allowing smooth adaptation to extreme viewpoint differences. We evaluate our approach on two tasks, outperforming SOTA models on both temporal keystep grounding and fine-grained keystep recognition benchmarks - particularly on views that exhibit severe occlusion.

Paper Structure

This paper contains 27 sections, 12 equations, 5 figures, 7 tables.

Figures (5)

  • Figure 1: Edited vs. natural procedural video.Top: Whereas edited video switches between close-in shots and wide-body shots to best capture the ongoing action, natural in-the-wild video can instead experience significant object and view occlusions. Bottom: Directly distilling the best view into an impoverished viewpoint has limited utility given the lack of shared visual content. Our curriculum knowledge distillation approach aligns features from source views with an incrementally better target view that still shares significant visual content. As training proceeds, we incorporate target views that better capture the ongoing action, but share less visual similarity with the source view.
  • Figure 2: Approach overview.a) Given an ego-worn camera looking down at the active workspace, we rank each exo camera by their view-alignment with the hand-object interaction region $p_{\text{center}}$ (green). To account for self-occlusion by the camera-wearer, we enforce that views facing the ego-camera (1, 2) are ranked ahead of views behind the ego-camera (3, 4). b) For each feature from a source view (highlighted in blue), we minimize similarity with the synchronous worst-rank view (cross-view negative) and with a feature from the same view demonstrating a different keystep (same-view negative). Our curriculum chooses a positive distillation target (cross-view positive) from incrementally higher-rank views over the course of training.
  • Figure 3: Downstream tasks.a) Our temporal keystep grounding model is input an untrimmed video $\mathcal{V}$ and sequence of keysteps $\mathcal{N}$ and regresses the center timestamp $\hat{c}_{n_i}$ and duration $\hat{d}_{n_i}$ for each narration $n_i$. We jointly optimize with our cross-view/cross-temporal knowledge distillation loss (red). b) We pre-train a keystep recognition model on randomly-selected clips from untrimmed videos. We rank the views using our metric and train with our view-contrastive loss that maximizes similarity with the best-exo view.
  • Figure 4: t-SNE of learned video features. We visualize video features learned by our grounding model's knowledge distill head (blue), best-view video features (green), and features from other synchronized views (red) on an input chunk of video. Our model closely aligns source view features with the best-view features throughout the video, despite the time-varying nature of the 'best view'.
  • Figure 5: Mean IoU difference (Ours - EgoVLPv2) by keystep name and task. We compute mean IoU across all instances and views of each unique keystep in the test set -- for both our model and the EgoVLPv2-trained grounding model. We display signed mean IoU difference between ours and EgoVLPv2 for the top-20 keysteps (left half) and bottom-20 keysteps (right half) that have largest mean IoU difference. We outperform EgoVLPv2 on keystep names that require an unobstructed view of fine-grained actions, despite being associated with cooking activities (blue) which exhibit widest viewpoint diversity in our dataset.