Table of Contents
Fetching ...

MOD-UV: Learning Mobile Object Detectors from Unlabeled Videos

Yihong Sun, Bharath Hariharan

TL;DR

MOD-UV tackles unsupervised mobile object detection by leveraging motion signals in unlabeled videos to bootstrap a detector that operates on single static frames. It bootstraps from motion segmentation to generate pseudo-labels and then uses a three-stage self-training pipeline—Moving2Mobile, Large2Small, and a Final Round—to recover static and small objects, producing a robust mobile object detector without external supervision. Across Waymo Open, nuScenes, KITTI, and COCO, MOD-UV delivers state-of-the-art performance for unsupervised mobile object detection and demonstrates strong cross-domain generalization, narrowing the gap to supervised methods. The work provides a practical framework for learning mobile-object detectors from unlabeled video, with public code available to enable replication and extension.

Abstract

Embodied agents must detect and localize objects of interest, e.g. traffic participants for self-driving cars. Supervision in the form of bounding boxes for this task is extremely expensive. As such, prior work has looked at unsupervised instance detection and segmentation, but in the absence of annotated boxes, it is unclear how pixels must be grouped into objects and which objects are of interest. This results in over-/under-segmentation and irrelevant objects. Inspired by human visual system and practical applications, we posit that the key missing cue for unsupervised detection is motion: objects of interest are typically mobile objects that frequently move and their motions can specify separate instances. In this paper, we propose MOD-UV, a Mobile Object Detector learned from Unlabeled Videos only. We begin with instance pseudo-labels derived from motion segmentation, but introduce a novel training paradigm to progressively discover small objects and static-but-mobile objects that are missed by motion segmentation. As a result, though only learned from unlabeled videos, MOD-UV can detect and segment mobile objects from a single static image. Empirically, we achieve state-of-the-art performance in unsupervised mobile object detection on Waymo Open, nuScenes, and KITTI Datasets without using any external data or supervised models. Code is available at https://github.com/YihongSun/MOD-UV.

MOD-UV: Learning Mobile Object Detectors from Unlabeled Videos

TL;DR

MOD-UV tackles unsupervised mobile object detection by leveraging motion signals in unlabeled videos to bootstrap a detector that operates on single static frames. It bootstraps from motion segmentation to generate pseudo-labels and then uses a three-stage self-training pipeline—Moving2Mobile, Large2Small, and a Final Round—to recover static and small objects, producing a robust mobile object detector without external supervision. Across Waymo Open, nuScenes, KITTI, and COCO, MOD-UV delivers state-of-the-art performance for unsupervised mobile object detection and demonstrates strong cross-domain generalization, narrowing the gap to supervised methods. The work provides a practical framework for learning mobile-object detectors from unlabeled video, with public code available to enable replication and extension.

Abstract

Embodied agents must detect and localize objects of interest, e.g. traffic participants for self-driving cars. Supervision in the form of bounding boxes for this task is extremely expensive. As such, prior work has looked at unsupervised instance detection and segmentation, but in the absence of annotated boxes, it is unclear how pixels must be grouped into objects and which objects are of interest. This results in over-/under-segmentation and irrelevant objects. Inspired by human visual system and practical applications, we posit that the key missing cue for unsupervised detection is motion: objects of interest are typically mobile objects that frequently move and their motions can specify separate instances. In this paper, we propose MOD-UV, a Mobile Object Detector learned from Unlabeled Videos only. We begin with instance pseudo-labels derived from motion segmentation, but introduce a novel training paradigm to progressively discover small objects and static-but-mobile objects that are missed by motion segmentation. As a result, though only learned from unlabeled videos, MOD-UV can detect and segment mobile objects from a single static image. Empirically, we achieve state-of-the-art performance in unsupervised mobile object detection on Waymo Open, nuScenes, and KITTI Datasets without using any external data or supervised models. Code is available at https://github.com/YihongSun/MOD-UV.
Paper Structure (36 sections, 1 equation, 3 figures, 11 tables, 1 algorithm)

This paper contains 36 sections, 1 equation, 3 figures, 11 tables, 1 algorithm.

Figures (3)

  • Figure 1: Our approach, MOD-UV, learns from unlabeled videos in Waymo Open waymo only and can reliably detect and segment mobile objects from a single input image.
  • Figure 2: Visualization of pseudo-labels at each stage of our self-training paradigm. From the initial pseudo-labels $L^{(0)}_i$ generated from motion mask, $L^{(1)}_i$ retrieves the large static objects after Moving2Mobile and $L^{(2)}_i$ recovers the small objects after Large2Small.
  • Figure 3: Qualitative Results on Waymo Open, nuScenes, KITTI, and COCO, where all proposals with over 0.5 confidence are visualized. For CutLER and HASSOD, we apply an additional filtering that removes any proposals with <0.1 IoU with ground truth mobile objects, as denoted by CutLER$^*$ and HASSOD$^*$, respectively.