Switch-a-View: View Selection Learned from Unlabeled In-the-wild Videos
Sagnik Majumder, Tushar Nagarajan, Ziad Al-Halah, Kristen Grauman
TL;DR
Switch-a-View tackles automatic view selection in multi-view instructional videos by learning from unlabeled but edited in-the-wild content. It introduces a view-switch detection pretext task trained with pseudo-labels derived from scene boundaries and ego/exo classification, and then repurposes the detector into a view selector using limited best-view labels. The method leverages a multimodal transformer that fuses past frames, past narrations, and the next narration to predict the next view, achieving state-of-the-art results on HowTo100M and Ego-Exo4D, including zero-shot generalization. This weakly supervised approach reduces labeling needs while enabling informative camerawork, with broad implications for automated cinematography and instructional video synthesis.
Abstract
We introduce SWITCH-A-VIEW, a model that learns to automatically select the viewpoint to display at each timepoint when creating a how-to video. The key insight of our approach is how to train such a model from unlabeled -- but human-edited -- video samples. We pose a pretext task that pseudo-labels segments in the training videos for their primary viewpoint (egocentric or exocentric), and then discovers the patterns between the visual and spoken content in a how-to video on the one hand and its view-switch moments on the other hand. Armed with this predictor, our model can be applied to new multi-view video settings for orchestrating which viewpoint should be displayed when, even when such settings come with limited labels. We demonstrate our idea on a variety of real-world videos from HowTo100M and Ego-Exo4D, and rigorously validate its advantages. Project: https://vision.cs.utexas.edu/projects/switch_a_view/.
