Table of Contents
Fetching ...

MVUDA: Unsupervised Domain Adaptation for Multi-view Pedestrian Detection

Erik Brorsson, Lennart Svensson, Kristofer Bengtsson, Knut Åkesson

TL;DR

MVUDA tackles cross-camera-rig generalization for multi-view pedestrian detection under a strict unsupervised domain adaptation setting. It extends mean-teacher self-training to BEV occupancy-map detectors and introduces a local-max pseudo-labeling scheme to produce reliable target labels without external labeled data, achieving state-of-the-art results on benchmarks like MultiviewX→Wildtrack and Wildtrack→MultiviewX. The approach demonstrates significant improvements over baselines and existing methods, with ablations confirming the value of mean-teacher guidance, data augmentation, and the local-max strategy. By enabling robust adaptation across camera setups, MVUDA provides a practical, data-efficient baseline for future research in multi-view perception and cross-rig deployment.

Abstract

We address multi-view pedestrian detection in a setting where labeled data is collected using a multi-camera setup different from the one used for testing. While recent multi-view pedestrian detectors perform well on the camera rig used for training, their performance declines when applied to a different setup. To facilitate seamless deployment across varied camera rigs, we propose an unsupervised domain adaptation (UDA) method that adapts the model to new rigs without requiring additional labeled data. Specifically, we leverage the mean teacher self-training framework with a novel pseudo-labeling technique tailored to multi-view pedestrian detection. This method achieves state-of-the-art performance on multiple benchmarks, including MultiviewX$\rightarrow$Wildtrack. Unlike previous methods, our approach eliminates the need for external labeled monocular datasets, thereby reducing reliance on labeled data. Extensive evaluations demonstrate the effectiveness of our method and validate key design choices. By enabling robust adaptation across camera setups, our work enhances the practicality of multi-view pedestrian detectors and establishes a strong UDA baseline for future research.

MVUDA: Unsupervised Domain Adaptation for Multi-view Pedestrian Detection

TL;DR

MVUDA tackles cross-camera-rig generalization for multi-view pedestrian detection under a strict unsupervised domain adaptation setting. It extends mean-teacher self-training to BEV occupancy-map detectors and introduces a local-max pseudo-labeling scheme to produce reliable target labels without external labeled data, achieving state-of-the-art results on benchmarks like MultiviewX→Wildtrack and Wildtrack→MultiviewX. The approach demonstrates significant improvements over baselines and existing methods, with ablations confirming the value of mean-teacher guidance, data augmentation, and the local-max strategy. By enabling robust adaptation across camera setups, MVUDA provides a practical, data-efficient baseline for future research in multi-view perception and cross-rig deployment.

Abstract

We address multi-view pedestrian detection in a setting where labeled data is collected using a multi-camera setup different from the one used for testing. While recent multi-view pedestrian detectors perform well on the camera rig used for training, their performance declines when applied to a different setup. To facilitate seamless deployment across varied camera rigs, we propose an unsupervised domain adaptation (UDA) method that adapts the model to new rigs without requiring additional labeled data. Specifically, we leverage the mean teacher self-training framework with a novel pseudo-labeling technique tailored to multi-view pedestrian detection. This method achieves state-of-the-art performance on multiple benchmarks, including MultiviewXWildtrack. Unlike previous methods, our approach eliminates the need for external labeled monocular datasets, thereby reducing reliance on labeled data. Extensive evaluations demonstrate the effectiveness of our method and validate key design choices. By enabling robust adaptation across camera setups, our work enhances the practicality of multi-view pedestrian detectors and establishes a strong UDA baseline for future research.

Paper Structure

This paper contains 20 sections, 7 equations, 7 figures, 13 tables.

Figures (7)

  • Figure 1: Since labeled multi-view datasets are scarce, current methods for multi-view pedestrian detection that rely on labeled (source) datasets for training do not perform well on new camera setups (target). We consider unsupervised domain adaptation (a) where labeled source data alongside pseudo-labeled target data is used for training, greatly improving the model's performance on multiple benchmarks (b).
  • Figure 2: An overview of our proposed self-training method for UDA multi-view pedestrian detection. A student is trained with labels on the source domain and pseudo-labels on the target domain, which are created by a mean teacher. While the teacher creates pseudo-labels on unaugmented data, the student receives strongly augmented images. Note that the label and pseudo-label have been softened with a Gaussian kernel in this figure to ease visualization.
  • Figure 3: Illustrative example of predicted occupancy scores in one dimension.
  • Figure 4: Example of the regressed foot heat map $\hat{y}^n_{f}$ for the first camera ($n=1$) in the MultiviewX dataset.
  • Figure 5: Example of projected pseudo-label $y^n_{f}$ for the first camera ($n=1$) in the MultiviewX dataset.
  • ...and 2 more figures