Table of Contents
Fetching ...

Which Viewpoint Shows it Best? Language for Weakly Supervising View Selection in Multi-view Instructional Videos

Sagnik Majumder, Tushar Nagarajan, Ziad Al-Halah, Reina Pradhan, Kristen Grauman

TL;DR

LangView introduces a language-guided, weakly supervised framework for selecting the most informative view in multi-view instructional videos. It learns best-view pseudo-labels from per-view captioners by ranking views against a view-agnostic narration, and trains a view selector augmented with a relative pose predictor to maintain view sensitivity. At test time, the model only requires the multi-view video and outputs the best view sequence per clip. Across Ego-Exo4D and LEMMA, LangView substantially outperforms baselines in automatic metrics and human judgments, validating the efficacy of using caption-based language signals to guide view selection. This approach offers a scalable, language-driven pathway to improve how instructional content is consumed and understood."

Abstract

Given a multi-view video, which viewpoint is most informative for a human observer? Existing methods rely on heuristics or expensive "best-view" supervision to answer this question, limiting their applicability. We propose a weakly supervised approach that leverages language accompanying an instructional multi-view video as a means to recover its most informative viewpoint(s). Our key hypothesis is that the more accurately an individual view can predict a view-agnostic text summary, the more informative it is. To put this into action, we propose LangView, a framework that uses the relative accuracy of view-dependent caption predictions as a proxy for best view pseudo-labels. Then, those pseudo-labels are used to train a view selector, together with an auxiliary camera pose predictor that enhances view-sensitivity. During inference, our model takes as input only a multi-view video--no language or camera poses--and returns the best viewpoint to watch at each timestep. On two challenging datasets comprised of diverse multi-camera setups and how-to activities, our model consistently outperforms state-of-the-art baselines, both with quantitative metrics and human evaluation. Project page: https://vision.cs.utexas.edu/projects/which-view-shows-it-best.

Which Viewpoint Shows it Best? Language for Weakly Supervising View Selection in Multi-view Instructional Videos

TL;DR

LangView introduces a language-guided, weakly supervised framework for selecting the most informative view in multi-view instructional videos. It learns best-view pseudo-labels from per-view captioners by ranking views against a view-agnostic narration, and trains a view selector augmented with a relative pose predictor to maintain view sensitivity. At test time, the model only requires the multi-view video and outputs the best view sequence per clip. Across Ego-Exo4D and LEMMA, LangView substantially outperforms baselines in automatic metrics and human judgments, validating the efficacy of using caption-based language signals to guide view selection. This approach offers a scalable, language-driven pathway to improve how instructional content is consumed and understood."

Abstract

Given a multi-view video, which viewpoint is most informative for a human observer? Existing methods rely on heuristics or expensive "best-view" supervision to answer this question, limiting their applicability. We propose a weakly supervised approach that leverages language accompanying an instructional multi-view video as a means to recover its most informative viewpoint(s). Our key hypothesis is that the more accurately an individual view can predict a view-agnostic text summary, the more informative it is. To put this into action, we propose LangView, a framework that uses the relative accuracy of view-dependent caption predictions as a proxy for best view pseudo-labels. Then, those pseudo-labels are used to train a view selector, together with an auxiliary camera pose predictor that enhances view-sensitivity. During inference, our model takes as input only a multi-view video--no language or camera poses--and returns the best viewpoint to watch at each timestep. On two challenging datasets comprised of diverse multi-camera setups and how-to activities, our model consistently outperforms state-of-the-art baselines, both with quantitative metrics and human evaluation. Project page: https://vision.cs.utexas.edu/projects/which-view-shows-it-best.

Paper Structure

This paper contains 44 sections, 2 equations, 8 figures, 9 tables.

Figures (8)

  • Figure 1: LangView idea: given multi-view instructional videos, we aim to learn a view selection model that can identify the best view for seeing how to perform the activity shown in the videos, in the absence of best view labels. To achieve this, we compare each estimated view-dependent caption to the view-agnostic ground-truth video narration of the human activity, and use their respective accuracies as a proxy for view quality. These quality scores then serve as pseudo-labels for learning to select the most informative view. In this example, the 1st view most clearly shows all entities involved in the activity---the wheel and the person's hands, and how they interact---and hence, produces a caption that best matches the ground-truth, making it a positive pseudo-label for view selection.
  • Figure 2: (a) Our model uses language guidance to train a view-selector for multi-view instructional videos, such that the chosen views help best understand the shown activity. To do so, we first generate best view pseudo-labels during training by leveraging clip narrations, where each narration is a view-agnostic and detailed description of the activity. Specifically, given a training clip, we use off-the-shelf video captioners to predict a caption per view, score the views by comparing their captions to the ground-truth narration, and finally rank the views to generate a best view pseudo-label for the clip. Given the multi-view clip, our view classifier (bottom-left) encodes it with a visual encoder, and predicts a pseudo-label estimate. We also solve an auxiliary task of relative camera pose prediction (bottom-right) that increases the view sensitivity of the classifier. (b) Examples of predicted narrations, and the ranks and scores of the views per our pseudo-labeler, shown alongside ground-truth view-agnostic narrations. “C" refers to the person who is performing the activity. Note that at inference time, there is no ground truth narration, just the video input.
  • Figure 3: Left: sample successful predictions by our view selector. For each clip, our model chooses the view that shows the action, and the objects and body parts involved in it, most clearly, and hence, is most informative. Right: Sample failure cases for our model, where there are multiple high-quality views that differ only in certain nuances, which are discernible by a human but not our model trained through narration guidance. Whereas humans prefer a view that better captures the direction of the ball towards the camera-wearer in sample 1, or shows the full backward motion of the dancers in sample 2, our model choose a view that shows all entities mentioned in the narration.
  • Figure 4: t-SNE vandermaaten08a plots of exo visual features of sample Ego-Exo4D grauman2023ego videos from basketball, bike repair, dance and cooking scenarios. Our model, when trained with the relative camera pose predictor, produces visual features that form neater clusters when grouped on the basis of different exo views, highlighting their improved view sensitivity.
  • Figure 5: Our model’s attention heatmaps on two best view clips from Ego-Exo4D grauman2023ego. Yellow patches indicate highest attention.
  • ...and 3 more figures