Table of Contents
Fetching ...

FMG-Det: Foundation Model Guided Robust Object Detection

Darryl Hannan, Timothy Doster, Henry Kvinge, Adam Attarian, Yijing Watkins

TL;DR

FMG-Det addresses bounding box annotation noise in object detection by correcting noisy boxes with a zero-shot FMC pipeline that leverages SAM and CLIP, then training with MIL-based denoising and an instance interpolation module. The approach is detector-agnostic and offline in the correction stage, achieving state-of-the-art MAE reductions on VOC and COCO, and notable gains in few-shot settings. It demonstrates that pre-training-time label rectification can dramatically improve downstream robustness to labeling noise while keeping training costs low. This enables more reliable object detection in domains with weak or ambiguous annotations and supports practical deployment with limited data.

Abstract

Collecting high quality data for object detection tasks is challenging due to the inherent subjectivity in labeling the boundaries of an object. This makes it difficult to not only collect consistent annotations across a dataset but also to validate them, as no two annotators are likely to label the same object using the exact same coordinates. These challenges are further compounded when object boundaries are partially visible or blurred, which can be the case in many domains. Training on noisy annotations significantly degrades detector performance, rendering them unusable, particularly in few-shot settings, where just a few corrupted annotations can impact model performance. In this work, we propose FMG-Det, a simple, efficient methodology for training models with noisy annotations. More specifically, we propose combining a multiple instance learning (MIL) framework with a pre-processing pipeline that leverages powerful foundation models to correct labels prior to training. This pre-processing pipeline, along with slight modifications to the detector head, results in state-of-the-art performance across a number of datasets, for both standard and few-shot scenarios, while being much simpler and more efficient than other approaches.

FMG-Det: Foundation Model Guided Robust Object Detection

TL;DR

FMG-Det addresses bounding box annotation noise in object detection by correcting noisy boxes with a zero-shot FMC pipeline that leverages SAM and CLIP, then training with MIL-based denoising and an instance interpolation module. The approach is detector-agnostic and offline in the correction stage, achieving state-of-the-art MAE reductions on VOC and COCO, and notable gains in few-shot settings. It demonstrates that pre-training-time label rectification can dramatically improve downstream robustness to labeling noise while keeping training costs low. This enables more reliable object detection in domains with weak or ambiguous annotations and supports practical deployment with limited data.

Abstract

Collecting high quality data for object detection tasks is challenging due to the inherent subjectivity in labeling the boundaries of an object. This makes it difficult to not only collect consistent annotations across a dataset but also to validate them, as no two annotators are likely to label the same object using the exact same coordinates. These challenges are further compounded when object boundaries are partially visible or blurred, which can be the case in many domains. Training on noisy annotations significantly degrades detector performance, rendering them unusable, particularly in few-shot settings, where just a few corrupted annotations can impact model performance. In this work, we propose FMG-Det, a simple, efficient methodology for training models with noisy annotations. More specifically, we propose combining a multiple instance learning (MIL) framework with a pre-processing pipeline that leverages powerful foundation models to correct labels prior to training. This pre-processing pipeline, along with slight modifications to the detector head, results in state-of-the-art performance across a number of datasets, for both standard and few-shot scenarios, while being much simpler and more efficient than other approaches.

Paper Structure

This paper contains 22 sections, 3 equations, 5 figures, 4 tables.

Figures (5)

  • Figure 1: Even small amounts of noise in bounding box coordinates can have a significant impact on the parts of an object that are captured by the bounding box. This is visualized for two noise levels on VOC 2007 training examples (blue = original, red = noisy). In this paper, we propose a method to mitigate bounding box annotation noise (green = corrected by our proposed method, FMG-Det).
  • Figure 2: Overview of our proposed Foundation Model Correction (FMC) pre-processing pipeline. Segment Anything extracts a set of candidate regions and corresponding set of scores, using both point and bounding box prompts to diversify the set of masks that are produced, CLIP scores each region using the ground truth label, the CLIP and SAM scores are combined and the mask with the highest score is selected, and lastly, the mask is converted to a bounding box and it is compared against the original noisy annotation to ensure it did not shift too severely. This pipeline is run in a zero-shot fashion, completely offline.
  • Figure 3: Illustration of how model performance is impacted as noise increases in few-shot settings for the PASCAL VOC dataset. The average drop in mAP is calculated by taking the MAE of the model across all noise levels as a percentage of the base model's performance with no noise.
  • Figure 4: VOC test mAP demonstrating the considerable impact of bounding box noise on model performance in prior state-of-the-art models, including OA-MIL and SSD-Det, compared to our proposed FMG-Det algorithm.
  • Figure 5: An overview of our proposed Instance Interpolation module. Both the corrected and noisy bounding boxes are passed to this module. It then extracts features for each bounding box using the backbone that already exists in the detector, and using these features, predicts a value $\gamma$ that is used to then interpolate between the corrected and noisy boxes.