Source-free Video Domain Adaptation by Learning from Noisy Labels

Avijit Dasgupta; C. V. Jawahar; Karteek Alahari

Source-free Video Domain Adaptation by Learning from Noisy Labels

Avijit Dasgupta, C. V. Jawahar, Karteek Alahari

TL;DR

This paper tackles the practicality gap in video domain adaptation by removing the need for source data during adaptation. It introduces CleanAdapt and CleanAdapt + TS, which exploit pseudo-labels from a source-pretrained model and selectively fine-tune on low-loss, likely-clean target samples; a teacher-student variant further stabilizes pseudo-labels. Through extensive experiments on UCF-HMDB and EPIC-Kitchens, the approach achieves state-of-the-art results among source-free methods and demonstrates robust cross-domain retrieval improvements. The work emphasizes simplicity and practicality, offering a privacy- and compute-friendly pathway for real-world video recognition under distribution shift.

Abstract

Despite the progress seen in classification methods, current approaches for handling videos with distribution shifts in source and target domains remain source-dependent as they require access to the source data during the adaptation stage. In this paper, we present a self-training based source-free video domain adaptation approach to address this challenge by bridging the gap between the source and the target domains. We use the source pre-trained model to generate pseudo-labels for the target domain samples, which are inevitably noisy. Thus, we treat the problem of source-free video domain adaptation as learning from noisy labels and argue that the samples with correct pseudo-labels can help us in adaptation. To this end, we leverage the cross-entropy loss as an indicator of the correctness of the pseudo-labels and use the resulting small-loss samples from the target domain for fine-tuning the model. We further enhance the adaptation performance by implementing a teacher-student (TS) framework, in which the teacher, which is updated gradually, produces reliable pseudo-labels. Meanwhile, the student undergoes fine-tuning on the target domain videos using these generated pseudo-labels to improve its performance. Extensive experimental evaluations show that our methods, termed as CleanAdapt, CleanAdapt + TS, achieve state-of-the-art results, outperforming the existing approaches on various open datasets. Our source code is publicly available at https://avijit9.github.io/CleanAdapt.

Source-free Video Domain Adaptation by Learning from Noisy Labels

TL;DR

Abstract

Paper Structure (18 sections, 9 equations, 9 figures, 9 tables)

This paper contains 18 sections, 9 equations, 9 figures, 9 tables.

Introduction
Related Work
Approach
Problem Definition
Self-training based Domain Adaptation
Clean Samples are All You Need
Source Pre-training
Pseudo-label Generation
CleanAdapt + TS: A Strong Video Adaptation Method
Results and Analysis
Datasets and Metrics
Implementation Details
Comparisons to the State-of-the-art Methods
Hyperparameter Search
Impact of Teacher-Student Framework
...and 3 more sections

Figures (9)

Figure 1: Existing approaches have a source-dependent adaptation stage achieving marginal performance gain over the source-pretrained models. On the other hand, our proposed methods CleanAdapt and CleanAdapt + TS achieve significant performance improvements over the source-only model while being source-free (i.e., the adaptation stage does not require videos from the source domain). (Best viewed in color.)
Figure 2: The radar plot illustrates the performance improvements of our proposed methods, CleanAdapt and CleanAdapt + TS (shown in orange and red, respectively), compared to the source-only model on multiple benchmarks. The source-only model (shown in yellow), trained on the source domain and tested on the target domain, serves as the lower bound of adaptation performance, while the target-supervised model (shown in blue), trained and tested on target domain videos, represents the upper bound. (Best viewed in color.)
Figure 3: Average cross-entropy loss per epoch of training with pseudo-labeled target domain videos for clean vs. noisy samples with (a) RGB modality and (b) Flow modality. We term the target domain samples with correct pseudo-labels as clean samples and with incorrect pseudo-labels as noisy samples. Note that, the groundtruth labels are only used to identify the clean vs. the noisy samples for visualization purposes and not used for training the model. Deep neural networks learn the clean samples first before memorizing the noisy samples according to the deep memorization effect as proposed in arpit2017closer. In our proposed approach CleanAdapt, we exploit this connection to select the clean samples for fine-tuning the model to adapt to the target domain. (Best viewed in color.)
Figure 4: Overview of the three stages of our CleanAdapt + TS framework for source-free video domain adaptation, which has three stages. (a) The model ($f_a$) is first pre-trained on the labeled source domain videos from $\mathcal{D}_s$. For brevity, only the single-stream model is shown here. (b) This source pre-trained model is then used to generate pseudo-labels $\hat{y}$ for the unlabeled target domain videos from $\mathcal{D}_t$. Inevitably, these pseudo-labels are noisy due to the domain shift between the source and the target domains. (c) A clean sample selection module is used to select a set $\mathcal{D}_{cl}$ of small-loss samples as potential clean samples. The source pre-trained model is finetuned on these clean samples from $\mathcal{D}_{cl}$ using their corresponding pseudo-labels $\hat{y}$. We repeat this step multiple times. See Sec. \ref{['sec:results']} for implementation details. (Best viewed in color.)
Figure 5: The clean sample selection module. The pseudo-labeled target domain videos from $\mathcal{D}_t$ are grouped according to their pseudo-labels $\hat{y}$ and sorted in ascending order of the loss generated by the model against their pseudo-labels. The keep-rate$\tau$ ($\tau = 0.6$ in this example) decides the number of samples to be selected for adaptation, having small-loss values for each class. For simplicity, we have used only four classes here. We show the videos with the correct pseudo-labels inside green border, whereas the videos with incorrect pseudo-labels are inside the red border solely for visualization purposes. (Best viewed in color.)
...and 4 more figures

Source-free Video Domain Adaptation by Learning from Noisy Labels

TL;DR

Abstract

Source-free Video Domain Adaptation by Learning from Noisy Labels

Authors

TL;DR

Abstract

Table of Contents

Figures (9)