Table of Contents
Fetching ...

FOCUS: Towards Universal Foreground Segmentation

Zuyao You, Lingyu Kong, Lingchen Meng, Zuxuan Wu

TL;DR

FOCUS addresses the fragmentation of foreground segmentation by introducing a unified, multi-modal framework that jointly models foreground and background. It uses ground queries, a multi-scale edge-enhanced backbone, and a CLIP-based distiller (CLIP refiner) to produce boundary-aware masks across SOD, COD, SD, DBD, and FD. Extensive experiments on 13 datasets and 5 tasks show that FOCUS matches or exceeds task-specific and other universal methods, demonstrating strong cross-task generalization and boundary precision. The work highlights the importance of background information and boundary cues for universal foreground segmentation and offers a practical, extensible approach for real-world applications.

Abstract

Foreground segmentation is a fundamental task in computer vision, encompassing various subdivision tasks. Previous research has typically designed task-specific architectures for each task, leading to a lack of unification. Moreover, they primarily focus on recognizing foreground objects without effectively distinguishing them from the background. In this paper, we emphasize the importance of the background and its relationship with the foreground. We introduce FOCUS, the Foreground ObjeCts Universal Segmentation framework that can handle multiple foreground tasks. We develop a multi-scale semantic network using the edge information of objects to enhance image features. To achieve boundary-aware segmentation, we propose a novel distillation method, integrating the contrastive learning strategy to refine the prediction mask in multi-modal feature space. We conduct extensive experiments on a total of 13 datasets across 5 tasks, and the results demonstrate that FOCUS consistently outperforms the state-of-the-art task-specific models on most metrics.

FOCUS: Towards Universal Foreground Segmentation

TL;DR

FOCUS addresses the fragmentation of foreground segmentation by introducing a unified, multi-modal framework that jointly models foreground and background. It uses ground queries, a multi-scale edge-enhanced backbone, and a CLIP-based distiller (CLIP refiner) to produce boundary-aware masks across SOD, COD, SD, DBD, and FD. Extensive experiments on 13 datasets and 5 tasks show that FOCUS matches or exceeds task-specific and other universal methods, demonstrating strong cross-task generalization and boundary precision. The work highlights the importance of background information and boundary cues for universal foreground segmentation and offers a practical, extensible approach for real-world applications.

Abstract

Foreground segmentation is a fundamental task in computer vision, encompassing various subdivision tasks. Previous research has typically designed task-specific architectures for each task, leading to a lack of unification. Moreover, they primarily focus on recognizing foreground objects without effectively distinguishing them from the background. In this paper, we emphasize the importance of the background and its relationship with the foreground. We introduce FOCUS, the Foreground ObjeCts Universal Segmentation framework that can handle multiple foreground tasks. We develop a multi-scale semantic network using the edge information of objects to enhance image features. To achieve boundary-aware segmentation, we propose a novel distillation method, integrating the contrastive learning strategy to refine the prediction mask in multi-modal feature space. We conduct extensive experiments on a total of 13 datasets across 5 tasks, and the results demonstrate that FOCUS consistently outperforms the state-of-the-art task-specific models on most metrics.
Paper Structure (18 sections, 12 equations, 4 figures, 4 tables)

This paper contains 18 sections, 12 equations, 4 figures, 4 tables.

Figures (4)

  • Figure 1: With one unified architecture, FOCUS can handle various foreground segmentation tasks. Our proposed method can generate boundary-aware masks that are smoother and more detailed than the previous state-of-the-art task-specific models. Zoom in for more details.
  • Figure 2: An overview of our proposed FOCUS, a multi-scale and multi-modal semantic framework for universal foreground segmentation, mainly includes the backbone, edge enhancer, feature decoder, and CLIP refiner. Refer to the main text for details.
  • Figure 3: Qualitative comparison of FOCUS and previous methods on COD, SOD, SD, DBD, and FD. Zoom in for more details.
  • Figure 4: The visualization of PCA-based dimensionality reduction on the feature maps across different iterations.