Table of Contents
Fetching ...

Fine-Grained Open-Vocabulary Object Recognition via User-Guided Segmentation

Jinwoo Ahn, Hyeokjoon Kwon, Hwiyeon Yoo

TL;DR

FOCUS merges the capabilities of vision foundation models to automate open-vocabulary object detection at flexible granularity and allow users to directly guide the detection process via natural language and minimizes unnecessary user intervention yet grants them significant control.

Abstract

Recent advent of vision-based foundation models has enabled efficient and high-quality object detection at ease. Despite the success of previous studies, object detection models face limitations on capturing small components from holistic objects and taking user intention into account. To address these challenges, we propose a novel foundation model-based detection method called FOCUS: Fine-grained Open-Vocabulary Object ReCognition via User-Guided Segmentation. FOCUS merges the capabilities of vision foundation models to automate open-vocabulary object detection at flexible granularity and allow users to directly guide the detection process via natural language. It not only excels at identifying and locating granular constituent elements but also minimizes unnecessary user intervention yet grants them significant control. With FOCUS, users can make explainable requests to actively guide the detection process in the intended direction. Our results show that FOCUS effectively enhances the detection capabilities of baseline models and shows consistent performance across varying object types.

Fine-Grained Open-Vocabulary Object Recognition via User-Guided Segmentation

TL;DR

FOCUS merges the capabilities of vision foundation models to automate open-vocabulary object detection at flexible granularity and allow users to directly guide the detection process via natural language and minimizes unnecessary user intervention yet grants them significant control.

Abstract

Recent advent of vision-based foundation models has enabled efficient and high-quality object detection at ease. Despite the success of previous studies, object detection models face limitations on capturing small components from holistic objects and taking user intention into account. To address these challenges, we propose a novel foundation model-based detection method called FOCUS: Fine-grained Open-Vocabulary Object ReCognition via User-Guided Segmentation. FOCUS merges the capabilities of vision foundation models to automate open-vocabulary object detection at flexible granularity and allow users to directly guide the detection process via natural language. It not only excels at identifying and locating granular constituent elements but also minimizes unnecessary user intervention yet grants them significant control. With FOCUS, users can make explainable requests to actively guide the detection process in the intended direction. Our results show that FOCUS effectively enhances the detection capabilities of baseline models and shows consistent performance across varying object types.

Paper Structure

This paper contains 18 sections, 2 equations, 6 figures, 1 table.

Figures (6)

  • Figure 1: A conceptual overview of the FOCUS framework is shown above. (1) First, the input image and bounding box are processed to eliminate unnecessary regions from the image. (2) Second, the VLM is prompted to propose all objects that are present in the processed image. (3) Third, the proposal from the previous step serves as guidance to output final detection results.
  • Figure 2: A full overview of the FOCUS framework is shown above. The first step extracts the binary mask of the region designated by the user as shown by the green box on the top left. Following this step, the VLM processes the image to output a proposal of objects. Finally, the detection model processes the proposal from the previous step to output the final detection results.
  • Figure 3: Robustness comparison between the DINO + FOCUS integration and the baseline DINO on changes to confidence threshold. Results show that FOCUS is more resistant to changes in threshold cutoffs, resulting in more confidence detections.
  • Figure 4: Sample results on different region sizes and object types. Top layer shows descending region size from left to right; bottom layer shows detections on inhumane targets. Results show that FOCUS is not restricted in the size of the target or the object type.
  • Figure 5: Sample results on the same target but using different prompts. Top later shows instance detection; bottom later shows anatomic detection. Results show that FOCUS can perform various detection tasks by varying the text prompt.
  • ...and 1 more figures