Table of Contents
Fetching ...

Visual Bridge: Universal Visual Perception Representations Generating

Yilin Gao, Shuguang Dou, Junzhou Li, Zhiheng Yu, Yin Li, Dongsheng Jiang, Shugong Xu

TL;DR

Vision Bridge introduces a universal flow-matching framework that converts tokens from a self-supervised vision foundation model into task-specific visual representations across classification, detection, segmentation, depth estimation, and image-text retrieval. It learns a velocity field conditioned on multi-scale and circular task embeddings to bridge heterogeneous tasks, enabling zero-shot transfer and flexible fine-tuning without external data. The approach demonstrates competitive or superior performance across five core vision tasks, supported by ablations and visual analyses that reveal robust generalization, scalable capacity, and meaningful feature dynamics. This work advances general-purpose vision modeling by unifying diverse perception tasks under a single, trainable flow-based paradigm grounded in token-to-representation transformations.

Abstract

Recent advances in diffusion models have achieved remarkable success in isolated computer vision tasks such as text-to-image generation, depth estimation, and optical flow. However, these models are often restricted by a ``single-task-single-model'' paradigm, severely limiting their generalizability and scalability in multi-task scenarios. Motivated by the cross-domain generalization ability of large language models, we propose a universal visual perception framework based on flow matching that can generate diverse visual representations across multiple tasks. Our approach formulates the process as a universal flow-matching problem from image patch tokens to task-specific representations rather than an independent generation or regression problem. By leveraging a strong self-supervised foundation model as the anchor and introducing a multi-scale, circular task embedding mechanism, our method learns a universal velocity field to bridge the gap between heterogeneous tasks, supporting efficient and flexible representation transfer. Extensive experiments on classification, detection, segmentation, depth estimation, and image-text retrieval demonstrate that our model achieves competitive performance in both zero-shot and fine-tuned settings, outperforming prior generalist and several specialist models. Ablation studies further validate the robustness, scalability, and generalization of our framework. Our work marks a significant step towards general-purpose visual perception, providing a solid foundation for future research in universal vision modeling.

Visual Bridge: Universal Visual Perception Representations Generating

TL;DR

Vision Bridge introduces a universal flow-matching framework that converts tokens from a self-supervised vision foundation model into task-specific visual representations across classification, detection, segmentation, depth estimation, and image-text retrieval. It learns a velocity field conditioned on multi-scale and circular task embeddings to bridge heterogeneous tasks, enabling zero-shot transfer and flexible fine-tuning without external data. The approach demonstrates competitive or superior performance across five core vision tasks, supported by ablations and visual analyses that reveal robust generalization, scalable capacity, and meaningful feature dynamics. This work advances general-purpose vision modeling by unifying diverse perception tasks under a single, trainable flow-based paradigm grounded in token-to-representation transformations.

Abstract

Recent advances in diffusion models have achieved remarkable success in isolated computer vision tasks such as text-to-image generation, depth estimation, and optical flow. However, these models are often restricted by a ``single-task-single-model'' paradigm, severely limiting their generalizability and scalability in multi-task scenarios. Motivated by the cross-domain generalization ability of large language models, we propose a universal visual perception framework based on flow matching that can generate diverse visual representations across multiple tasks. Our approach formulates the process as a universal flow-matching problem from image patch tokens to task-specific representations rather than an independent generation or regression problem. By leveraging a strong self-supervised foundation model as the anchor and introducing a multi-scale, circular task embedding mechanism, our method learns a universal velocity field to bridge the gap between heterogeneous tasks, supporting efficient and flexible representation transfer. Extensive experiments on classification, detection, segmentation, depth estimation, and image-text retrieval demonstrate that our model achieves competitive performance in both zero-shot and fine-tuned settings, outperforming prior generalist and several specialist models. Ablation studies further validate the robustness, scalability, and generalization of our framework. Our work marks a significant step towards general-purpose visual perception, providing a solid foundation for future research in universal vision modeling.

Paper Structure

This paper contains 30 sections, 7 equations, 7 figures, 8 tables, 2 algorithms.

Figures (7)

  • Figure 1: (a) We introduce Vision Bridge, a universal framework that bridges image patch tokens and fundamental vision tasks through flow-based modeling. The architecture supports diverse downstream tasks, including classification, detection, segmentation, depth estimation, and image-text retrieval. Vision Bridge enables task-agnostic representation learning and task-specific adaptation without introducing external data. (b) Radar plot comparing performance across five core vision tasks. Our method consistently outperforms strong baselines on ImageNet-1K deng2009imagenet, COCO coco, ADE20K ade20k, and NYUv2 nyu.
  • Figure 2: Overview of the proposed Visual Bridge. During training, tokens from the foundation model are sampled and interpolated with task-specific representations at multiple scales. A universal velocity field, conditioned on circular task embeddings and learnable scale embeddings, models the dynamics at each step. During inference, the learned flow is integrated to generate task-specific outputs (e.g., bounding boxes, labels, segmentation masks) using dedicated decoders. The proposed architecture enables efficient and flexible unification of a wide range of visual perception tasks.
  • Figure 3: Similarity (line plot) and variance (bar plot) of generated (Flow z) and target (Flow b) latents across different models. Each bar represents the average standard deviation of features per dimension; the line denotes cosine similarity between Flow z and Flow b.
  • Figure 4: Comparative Analysis of Feature Evolution Dynamics. Flow Z demonstrates an exploratory strategy characterized by early-stage feature exploration followed by rapid convergence. Flow B exhibits a progressive refinement pattern with incremental adjustments toward the target.
  • Figure 5: Architectural comparison across paradigms for univeral vision modeling. (a) Traditional task-specific models adopt a latent-to-latent mapping, where each downstream task requires a dedicated architecture and training pipeline, leading to poor scalability and high deployment cost. (b) Fully generative approaches reconstruct visual outputs by denoising from random noise, treating all tasks as image generation. While flexible for dense prediction, they struggle with non-generative tasks (e.g., classification, retrieval) and often produce semantically inconsistent predictions without strong conditioning. (c) Our method leverages a self-supervised tokenizer to extract semantic-aware tokens, then employs flow matching to model the dynamic transformation from tokens to task-specific latents. This enables precise, structured, and efficient knowledge routing across diverse vision tasks—including generative, discriminative, and metric-based tasks—within a univeral framework.
  • ...and 2 more figures