Table of Contents
Fetching ...

Video Unsupervised Domain Adaptation with Deep Learning: A Comprehensive Survey

Yuecong Xu, Haozhi Cao, Zhenghua Chen, Xiaoli Li, Lihua Xie, Jianfei Yang

TL;DR

Video unsupervised domain adaptation (VUDA) tackles domain shifts between labeled source videos and unlabeled target videos to improve generalization without target annotations. The survey categorizes closed-set VUDA methods into adversarial, discrepancy-based, semantic-based, reconstruction-based, and composite families, and reviews non-closed-set scenarios (PVDA, OSVDA, MSVDA, SFVDA, BVDA, VTTA, UI2V) with their respective techniques. It catalogues a wide range of benchmark datasets spanning primary, larger-shift, partial-set, multi-domain, VDG, and cross-domain video semantic segmentation tasks, and discusses backbone choices and their impact on performance. The authors identify practical challenges—multi-modality handling, privacy, and lack of theoretical VUDA grounding—and propose directions around transformer-based backbones, language-vision models, and expanded VUDA scenarios to enhance real-world applicability.

Abstract

Video analysis tasks such as action recognition have received increasing research interest with growing applications in fields such as smart healthcare, thanks to the introduction of large-scale datasets and deep learning-based representations. However, video models trained on existing datasets suffer from significant performance degradation when deployed directly to real-world applications due to domain shifts between the training public video datasets (source video domains) and real-world videos (target video domains). Further, with the high cost of video annotation, it is more practical to use unlabeled videos for training. To tackle performance degradation and address concerns in high video annotation cost uniformly, the video unsupervised domain adaptation (VUDA) is introduced to adapt video models from the labeled source domain to the unlabeled target domain by alleviating video domain shift, improving the generalizability and portability of video models. This paper surveys recent progress in VUDA with deep learning. We begin with the motivation of VUDA, followed by its definition, and recent progress of methods for both closed-set VUDA and VUDA under different scenarios, and current benchmark datasets for VUDA research. Eventually, future directions are provided to promote further VUDA research. The repository of this survey is provided at https://github.com/xuyu0010/awesome-video-domain-adaptation.

Video Unsupervised Domain Adaptation with Deep Learning: A Comprehensive Survey

TL;DR

Video unsupervised domain adaptation (VUDA) tackles domain shifts between labeled source videos and unlabeled target videos to improve generalization without target annotations. The survey categorizes closed-set VUDA methods into adversarial, discrepancy-based, semantic-based, reconstruction-based, and composite families, and reviews non-closed-set scenarios (PVDA, OSVDA, MSVDA, SFVDA, BVDA, VTTA, UI2V) with their respective techniques. It catalogues a wide range of benchmark datasets spanning primary, larger-shift, partial-set, multi-domain, VDG, and cross-domain video semantic segmentation tasks, and discusses backbone choices and their impact on performance. The authors identify practical challenges—multi-modality handling, privacy, and lack of theoretical VUDA grounding—and propose directions around transformer-based backbones, language-vision models, and expanded VUDA scenarios to enhance real-world applicability.

Abstract

Video analysis tasks such as action recognition have received increasing research interest with growing applications in fields such as smart healthcare, thanks to the introduction of large-scale datasets and deep learning-based representations. However, video models trained on existing datasets suffer from significant performance degradation when deployed directly to real-world applications due to domain shifts between the training public video datasets (source video domains) and real-world videos (target video domains). Further, with the high cost of video annotation, it is more practical to use unlabeled videos for training. To tackle performance degradation and address concerns in high video annotation cost uniformly, the video unsupervised domain adaptation (VUDA) is introduced to adapt video models from the labeled source domain to the unlabeled target domain by alleviating video domain shift, improving the generalizability and portability of video models. This paper surveys recent progress in VUDA with deep learning. We begin with the motivation of VUDA, followed by its definition, and recent progress of methods for both closed-set VUDA and VUDA under different scenarios, and current benchmark datasets for VUDA research. Eventually, future directions are provided to promote further VUDA research. The repository of this survey is provided at https://github.com/xuyu0010/awesome-video-domain-adaptation.
Paper Structure (27 sections, 3 figures, 10 tables)

This paper contains 27 sections, 3 figures, 10 tables.

Figures (3)

  • Figure 1: Overview of the categorization of the different VUDA methods. Closed-set VUDA methods are constrained by the constraint of an identical label space shared by the single pair of video source/target domains and assume that both the source and target data are accessible, with action recognition as the cross-domain task. Any VUDA methods that does not satisfy the four constraints/assumptions are considered as non-closed-set VUDA. Closed-set VUDA methods can be categorized into five categories based on how source and target domains are aligned. A broader categorization strategy is adopted thanks to the limited available research, and to enable a broader picture of the current progress of VUDA approaches. Non-closed-set VUDA methods are categorized into four categories by how their related scenarios differ from the closed-set VUDA.
  • Figure 2: Training and testing pipeline of VUDA. Note the blue lines indicate the data flow for the labeled source data, while the olive lines indicate the data flow for the unlabeled target data. The trained shared video model is frozen during testing (indicated by ❄).
  • Figure 3: Comparing non-closed-set VUDA versus closed-set VUDA.