Table of Contents
Fetching ...

An Attribute-Enriched Dataset and Auto-Annotated Pipeline for Open Detection

Pengfei Qi, Yifei Zhang, Wenqiang Li, Youwen Hu, Kunlong Bai

TL;DR

The Objects365-Attr dataset is introduced, an extension of the existing Objects365 dataset, distinguished by its attribute annotations, which reduces inconsistencies in object detection by integrating a broad spectrum of attributes, including color, material, state, texture and tone.

Abstract

Detecting objects of interest through language often presents challenges, particularly with objects that are uncommon or complex to describe, due to perceptual discrepancies between automated models and human annotators. These challenges highlight the need for comprehensive datasets that go beyond standard object labels by incorporating detailed attribute descriptions. To address this need, we introduce the Objects365-Attr dataset, an extension of the existing Objects365 dataset, distinguished by its attribute annotations. This dataset reduces inconsistencies in object detection by integrating a broad spectrum of attributes, including color, material, state, texture and tone. It contains an extensive collection of 5.6M object-level attribute descriptions, meticulously annotated across 1.4M bounding boxes. Additionally, to validate the dataset's effectiveness, we conduct a rigorous evaluation of YOLO-World at different scales, measuring their detection performance and demonstrating the dataset's contribution to advancing object detection.

An Attribute-Enriched Dataset and Auto-Annotated Pipeline for Open Detection

TL;DR

The Objects365-Attr dataset is introduced, an extension of the existing Objects365 dataset, distinguished by its attribute annotations, which reduces inconsistencies in object detection by integrating a broad spectrum of attributes, including color, material, state, texture and tone.

Abstract

Detecting objects of interest through language often presents challenges, particularly with objects that are uncommon or complex to describe, due to perceptual discrepancies between automated models and human annotators. These challenges highlight the need for comprehensive datasets that go beyond standard object labels by incorporating detailed attribute descriptions. To address this need, we introduce the Objects365-Attr dataset, an extension of the existing Objects365 dataset, distinguished by its attribute annotations. This dataset reduces inconsistencies in object detection by integrating a broad spectrum of attributes, including color, material, state, texture and tone. It contains an extensive collection of 5.6M object-level attribute descriptions, meticulously annotated across 1.4M bounding boxes. Additionally, to validate the dataset's effectiveness, we conduct a rigorous evaluation of YOLO-World at different scales, measuring their detection performance and demonstrating the dataset's contribution to advancing object detection.
Paper Structure (10 sections, 3 figures, 4 tables)

This paper contains 10 sections, 3 figures, 4 tables.

Figures (3)

  • Figure 1: (a) illustrates an example from the Objects365-Attr dataset, generated through the auto-annotated pipeline. The goal is to systematically output all visual attributes for each category represented in the image. (b) shows that the five major categories and their corresponding 39 subcategories within the Objects365-Attr dataset.
  • Figure 2: The Auto-Annotated Pipeline. It consists of three key steps, including creating structured dataset for training LLaVA, LLaVA finetune and inference, Data check, correct and output. These steps effectively leverage the image understanding and text generation capabilities of multimodal large models, combining existing attribute datasets with a small amount of human involvement to form our automatic annotation pipeline.
  • Figure 3: displays the inference results using the YOLO-World-L weights,shows the inference results after additional pre-training with the Objects365-Attr dataset. (a), (b) for the visualization on the LVIS dataset, focusing exclusively on the rare categories. (c), (d) for the visualization of a class name with attributes. (e), (f) are visualizations of different weights resulting from inputting different prompts.