MOD-UV: Learning Mobile Object Detectors from Unlabeled Videos

Yihong Sun; Bharath Hariharan

MOD-UV: Learning Mobile Object Detectors from Unlabeled Videos

Yihong Sun, Bharath Hariharan

TL;DR

MOD-UV tackles unsupervised mobile object detection by leveraging motion signals in unlabeled videos to bootstrap a detector that operates on single static frames. It bootstraps from motion segmentation to generate pseudo-labels and then uses a three-stage self-training pipeline—Moving2Mobile, Large2Small, and a Final Round—to recover static and small objects, producing a robust mobile object detector without external supervision. Across Waymo Open, nuScenes, KITTI, and COCO, MOD-UV delivers state-of-the-art performance for unsupervised mobile object detection and demonstrates strong cross-domain generalization, narrowing the gap to supervised methods. The work provides a practical framework for learning mobile-object detectors from unlabeled video, with public code available to enable replication and extension.

Abstract

Embodied agents must detect and localize objects of interest, e.g. traffic participants for self-driving cars. Supervision in the form of bounding boxes for this task is extremely expensive. As such, prior work has looked at unsupervised instance detection and segmentation, but in the absence of annotated boxes, it is unclear how pixels must be grouped into objects and which objects are of interest. This results in over-/under-segmentation and irrelevant objects. Inspired by human visual system and practical applications, we posit that the key missing cue for unsupervised detection is motion: objects of interest are typically mobile objects that frequently move and their motions can specify separate instances. In this paper, we propose MOD-UV, a Mobile Object Detector learned from Unlabeled Videos only. We begin with instance pseudo-labels derived from motion segmentation, but introduce a novel training paradigm to progressively discover small objects and static-but-mobile objects that are missed by motion segmentation. As a result, though only learned from unlabeled videos, MOD-UV can detect and segment mobile objects from a single static image. Empirically, we achieve state-of-the-art performance in unsupervised mobile object detection on Waymo Open, nuScenes, and KITTI Datasets without using any external data or supervised models. Code is available at https://github.com/YihongSun/MOD-UV.

MOD-UV: Learning Mobile Object Detectors from Unlabeled Videos

TL;DR

Abstract

Paper Structure (36 sections, 1 equation, 3 figures, 11 tables, 1 algorithm)

This paper contains 36 sections, 1 equation, 3 figures, 11 tables, 1 algorithm.

Introduction
Related Work
Unsupervised Object Detection/Discovery from Images.
Unsupervised Object Detection/Discovery from 3D.
Unsupervised Object Detection/Discovery from Videos.
Method
Problem setup:
Initialization with Unsupervised Motion Segmentation
Self-Training for Unsupervised Mobile Object Detection.
Moving2Mobile: Learning to Detect Static Objects.
Large2Small: Learning to Detect Small Objects.
Final Round of Self-Training.
Implementation details
Experiments
Experimental Setup
...and 21 more sections

Figures (3)

Figure 1: Our approach, MOD-UV, learns from unlabeled videos in Waymo Open waymo only and can reliably detect and segment mobile objects from a single input image.
Figure 2: Visualization of pseudo-labels at each stage of our self-training paradigm. From the initial pseudo-labels $L^{(0)}_i$ generated from motion mask, $L^{(1)}_i$ retrieves the large static objects after Moving2Mobile and $L^{(2)}_i$ recovers the small objects after Large2Small.
Figure 3: Qualitative Results on Waymo Open, nuScenes, KITTI, and COCO, where all proposals with over 0.5 confidence are visualized. For CutLER and HASSOD, we apply an additional filtering that removes any proposals with <0.1 IoU with ground truth mobile objects, as denoted by CutLER$^*$ and HASSOD$^*$, respectively.

MOD-UV: Learning Mobile Object Detectors from Unlabeled Videos

TL;DR

Abstract

MOD-UV: Learning Mobile Object Detectors from Unlabeled Videos

Authors

TL;DR

Abstract

Table of Contents

Figures (3)