Table of Contents
Fetching ...

UFO: A Unified Approach to Fine-grained Visual Perception via Open-ended Language Interface

Hao Tang, Chenwei Xie, Haiyang Wang, Xiaoyi Bao, Tingyu Weng, Pandeng Li, Yun Zheng, Liwei Wang

TL;DR

UFO proposes a decoder-free, unified framework that converts all fine-grained perception targets into an open-ended language interface, enabling detection, segmentation, and vision-language tasks within a single model. It reformulates segmentation as an embedding retrieval problem using mask token embeddings andupsamples masks via $N^2$ mask tokens, all while keeping output text sequences open-ended. The approach demonstrates substantial gains over prior generalist models and achieves competitive results with specialist methods on COCO, ADE20K, and ReasonSeg, while remaining compatible with large vision-language models through simple post-processing. By eliminating task-specific decoders and leveraging shared image-text representations, UFO offers a scalable path toward stronger, more capable multimodal systems integrated with existing MLLMs.

Abstract

Generalist models have achieved remarkable success in both language and vision-language tasks, showcasing the potential of unified modeling. However, effectively integrating fine-grained perception tasks like detection and segmentation into these models remains a significant challenge. This is primarily because these tasks often rely heavily on task-specific designs and architectures that can complicate the modeling process. To address this challenge, we present \ours, a framework that \textbf{U}nifies \textbf{F}ine-grained visual perception tasks through an \textbf{O}pen-ended language interface. By transforming all perception targets into the language space, \ours unifies object-level detection, pixel-level segmentation, and image-level vision-language tasks into a single model. Additionally, we introduce a novel embedding retrieval approach that relies solely on the language interface to support segmentation tasks. Our framework bridges the gap between fine-grained perception and vision-language tasks, significantly simplifying architectural design and training strategies while achieving comparable or superior performance to methods with intricate task-specific designs. After multi-task training on five standard visual perception datasets, \ours outperforms the previous state-of-the-art generalist models by 12.3 mAP on COCO instance segmentation and 3.3 mIoU on ADE20K semantic segmentation. Furthermore, our method seamlessly integrates with existing MLLMs, effectively combining fine-grained perception capabilities with their advanced language abilities, thereby enabling more challenging tasks such as reasoning segmentation. Code and models are available at https://github.com/nnnth/UFO.

UFO: A Unified Approach to Fine-grained Visual Perception via Open-ended Language Interface

TL;DR

UFO proposes a decoder-free, unified framework that converts all fine-grained perception targets into an open-ended language interface, enabling detection, segmentation, and vision-language tasks within a single model. It reformulates segmentation as an embedding retrieval problem using mask token embeddings andupsamples masks via mask tokens, all while keeping output text sequences open-ended. The approach demonstrates substantial gains over prior generalist models and achieves competitive results with specialist methods on COCO, ADE20K, and ReasonSeg, while remaining compatible with large vision-language models through simple post-processing. By eliminating task-specific decoders and leveraging shared image-text representations, UFO offers a scalable path toward stronger, more capable multimodal systems integrated with existing MLLMs.

Abstract

Generalist models have achieved remarkable success in both language and vision-language tasks, showcasing the potential of unified modeling. However, effectively integrating fine-grained perception tasks like detection and segmentation into these models remains a significant challenge. This is primarily because these tasks often rely heavily on task-specific designs and architectures that can complicate the modeling process. To address this challenge, we present \ours, a framework that \textbf{U}nifies \textbf{F}ine-grained visual perception tasks through an \textbf{O}pen-ended language interface. By transforming all perception targets into the language space, \ours unifies object-level detection, pixel-level segmentation, and image-level vision-language tasks into a single model. Additionally, we introduce a novel embedding retrieval approach that relies solely on the language interface to support segmentation tasks. Our framework bridges the gap between fine-grained perception and vision-language tasks, significantly simplifying architectural design and training strategies while achieving comparable or superior performance to methods with intricate task-specific designs. After multi-task training on five standard visual perception datasets, \ours outperforms the previous state-of-the-art generalist models by 12.3 mAP on COCO instance segmentation and 3.3 mIoU on ADE20K semantic segmentation. Furthermore, our method seamlessly integrates with existing MLLMs, effectively combining fine-grained perception capabilities with their advanced language abilities, thereby enabling more challenging tasks such as reasoning segmentation. Code and models are available at https://github.com/nnnth/UFO.

Paper Structure

This paper contains 29 sections, 10 equations, 10 figures, 21 tables.

Figures (10)

  • Figure 1: Methods to augment MLLMs with fine-grained perception tasks. (a) Relying on task decoders lai2024lisawu2024visionllm, (b) Previous text-based methods represent boxes with location tokens peng2023kosmos and represent masks with suboptimal polygons wang2023visionllmwang2024git or textual classes wang2024gitlan2024text4seg, (c) Ours: predicting open-ended text sequences while using a simple yet effective embedding retrieval approach for masks.
  • Figure 2: Overview of our approach. (a) Segmentation modeling: the mask token embedding retrieves similar image features to generate masks (shown with matching colors). (b) Upsampling masks by multiple mask tokens, retrieving more details by more tokens. We use $N$=2 to illustrate while using $N$=4 in implementation. (c) We output open-ended text sequences with textual numbers for detection.
  • Figure 3: Multi-task data template examples. Red dots represent sampled grid point features, acting as local visual prompts for generating text sequences for nearby objects.
  • Figure 4: Attention mask visualizations. (a) We apply bidirectional attention for image features. (b) For multi-prediction tasks, we mask each subsequence from seeing others.
  • Figure 5: Visualizations of retinal vessel segmentation.
  • ...and 5 more figures