Table of Contents
Fetching ...

Switch-a-View: View Selection Learned from Unlabeled In-the-wild Videos

Sagnik Majumder, Tushar Nagarajan, Ziad Al-Halah, Kristen Grauman

TL;DR

Switch-a-View tackles automatic view selection in multi-view instructional videos by learning from unlabeled but edited in-the-wild content. It introduces a view-switch detection pretext task trained with pseudo-labels derived from scene boundaries and ego/exo classification, and then repurposes the detector into a view selector using limited best-view labels. The method leverages a multimodal transformer that fuses past frames, past narrations, and the next narration to predict the next view, achieving state-of-the-art results on HowTo100M and Ego-Exo4D, including zero-shot generalization. This weakly supervised approach reduces labeling needs while enabling informative camerawork, with broad implications for automated cinematography and instructional video synthesis.

Abstract

We introduce SWITCH-A-VIEW, a model that learns to automatically select the viewpoint to display at each timepoint when creating a how-to video. The key insight of our approach is how to train such a model from unlabeled -- but human-edited -- video samples. We pose a pretext task that pseudo-labels segments in the training videos for their primary viewpoint (egocentric or exocentric), and then discovers the patterns between the visual and spoken content in a how-to video on the one hand and its view-switch moments on the other hand. Armed with this predictor, our model can be applied to new multi-view video settings for orchestrating which viewpoint should be displayed when, even when such settings come with limited labels. We demonstrate our idea on a variety of real-world videos from HowTo100M and Ego-Exo4D, and rigorously validate its advantages. Project: https://vision.cs.utexas.edu/projects/switch_a_view/.

Switch-a-View: View Selection Learned from Unlabeled In-the-wild Videos

TL;DR

Switch-a-View tackles automatic view selection in multi-view instructional videos by learning from unlabeled but edited in-the-wild content. It introduces a view-switch detection pretext task trained with pseudo-labels derived from scene boundaries and ego/exo classification, and then repurposes the detector into a view selector using limited best-view labels. The method leverages a multimodal transformer that fuses past frames, past narrations, and the next narration to predict the next view, achieving state-of-the-art results on HowTo100M and Ego-Exo4D, including zero-shot generalization. This weakly supervised approach reduces labeling needs while enabling informative camerawork, with broad implications for automated cinematography and instructional video synthesis.

Abstract

We introduce SWITCH-A-VIEW, a model that learns to automatically select the viewpoint to display at each timepoint when creating a how-to video. The key insight of our approach is how to train such a model from unlabeled -- but human-edited -- video samples. We pose a pretext task that pseudo-labels segments in the training videos for their primary viewpoint (egocentric or exocentric), and then discovers the patterns between the visual and spoken content in a how-to video on the one hand and its view-switch moments on the other hand. Armed with this predictor, our model can be applied to new multi-view video settings for orchestrating which viewpoint should be displayed when, even when such settings come with limited labels. We demonstrate our idea on a variety of real-world videos from HowTo100M and Ego-Exo4D, and rigorously validate its advantages. Project: https://vision.cs.utexas.edu/projects/switch_a_view/.

Paper Structure

This paper contains 53 sections, 9 equations, 7 figures, 9 tables.

Figures (7)

  • Figure 1: Given a multi-view narrated how-to video, can we select the sequence of camera viewpoints that best show the activity---automating the camerawork that is today done with manual editing? While direct supervision for this task is impractical, our Switch-a-view approach shows how to learn typical viewpoint choice patterns from large-scale unlabeled in-the-wild instructional videos (left), then translate those patterns to novel multi-view videos (right), yielding an informative how-to that hops between the most useful ego/exo viewpoints.
  • Figure 2: Given varying-view instructional videos---videos composed of a sequence of views chosen by human(s) to accurately show the instructional activity at all times---our goal is to train a view-switch detector $D$ that can predict if the view should switch or not, at any time in a new video. Our hypothesis is that such a detector, when trained on large-scale and in-the-wild videos, can capture human view preferences and facilitate learning best view selection in multi-view settings with limited labels. However, such in-the-wild videos lack view labels. To train nevertheless, we propose an approach comprising (a) a view pseudo-labeler (left) that given a varying-view instructional video $I$, automatically classifies views in it and generates a pseudo-label set $\tilde{V}^I$, and (b) a view-switch detector $D$ (right) that given the pseudo-labels $\tilde{V}^I$ and any time $t$ in $I$, learns to predict the next view. The prediction is conditioned on the past frames, past narrations, and the next narration, where narrations are naturally occurring spoken content from the how-to demonstrator.
  • Figure 3: (a) Effect of sample count on our view selection (VS) performance; (b) Impact of joint finetuning with narration-based pseudo-labels majumder2024viewpoint and best view labels on view selection (VS)
  • Figure 4: Left: successful view-switch detections by our model on same-view (top) and view-switch cases (bottom). Our model correctly detects view switches by popenopenotentially anticipating the next step using past frames (same-view sample 1, and view-switch sample 2) or leveraging the content of the next narration (same-view sample 2, and view-switch sample 1 and 2). Right: successful view selections by our model on same-view (top) and view-switch cases (bottom). For view selection as well, our model can predict the desired next view by relying on the next narration (same-view sample 1, and view-switch sample 1 and 2), or anticipate it using the past narrations (same-view sample 1 and 2), or the past frames (same-view sample 1). These examples show that all three inputs play a role in our model predictions.
  • Figure 5: Per-scenario breakdown of our and the strongest baseline, LangView-bigData's view selection performance, measured with AP ($\%$).
  • ...and 2 more figures