Table of Contents
Fetching ...

Pseudo Dataset Generation for Out-of-Domain Multi-Camera View Recommendation

Kuan-Ying Lee, Qian Zhou, Klara Nahrstedt

TL;DR

By training the model on pseudo-labeled datasets stemming from videos in the target domain, the model achieves a 68% relative improvement in the model’s accuracy in the target domain and bridge the accuracy gap between in-domain and never-before-seen domains.

Abstract

Multi-camera systems are indispensable in movies, TV shows, and other media. Selecting the appropriate camera at every timestamp has a decisive impact on production quality and audience preferences. Learning-based view recommendation frameworks can assist professionals in decision-making. However, they often struggle outside of their training domains. The scarcity of labeled multi-camera view recommendation datasets exacerbates the issue. Based on the insight that many videos are edited from the original multi-camera videos, we propose transforming regular videos into pseudo-labeled multi-camera view recommendation datasets. Promisingly, by training the model on pseudo-labeled datasets stemming from videos in the target domain, we achieve a 68% relative improvement in the model's accuracy in the target domain and bridge the accuracy gap between in-domain and never-before-seen domains.

Pseudo Dataset Generation for Out-of-Domain Multi-Camera View Recommendation

TL;DR

By training the model on pseudo-labeled datasets stemming from videos in the target domain, the model achieves a 68% relative improvement in the model’s accuracy in the target domain and bridge the accuracy gap between in-domain and never-before-seen domains.

Abstract

Multi-camera systems are indispensable in movies, TV shows, and other media. Selecting the appropriate camera at every timestamp has a decisive impact on production quality and audience preferences. Learning-based view recommendation frameworks can assist professionals in decision-making. However, they often struggle outside of their training domains. The scarcity of labeled multi-camera view recommendation datasets exacerbates the issue. Based on the insight that many videos are edited from the original multi-camera videos, we propose transforming regular videos into pseudo-labeled multi-camera view recommendation datasets. Promisingly, by training the model on pseudo-labeled datasets stemming from videos in the target domain, we achieve a 68% relative improvement in the model's accuracy in the target domain and bridge the accuracy gap between in-domain and never-before-seen domains.

Paper Structure

This paper contains 10 sections, 3 figures, 3 tables.

Figures (3)

  • Figure 1: (a) A model trained on a labeled multi-camera editing dataset of a particular domain generalizes poorly to a never-before-seen domain and the accuracy drops significantly. (b) Our proposed method leverages regular videos to generate pseudo-labeled datasets for the target domain and improve the model's accuracy. [Best viewed in color.]
  • Figure 2: Model Architecture. (a) The past encoder encodes all past features to a single feature vector. Then, a contrastive loss is applied to maximize the cosine similarity between the past and ground-truth features. (b) The feature extractor encodes a frame by adding a positional embedding to the image feature. : trainable, : frozen. GT: groundtruth. The number F$_{\textit{N}}$ indicates the $N^{th}$ frame from the video. [Best viewed in color.]
  • Figure 3: Pseudo Dataset Generation Pipeline. (a) Shots are detected in the input video, and (b) clustered into groups. Shots within the same cluster are regarded as from the same "pseudo" camera. (c) A shot is selected as an anchor. The succeeding shot is the ground truth, while the most similar shot amongst each of the other N-1 pseudo cameras is chosen as a candidate. [Best viewed in color with zoom-in.]