Table of Contents
Fetching ...

Modeling Cross-vision Synergy for Unified Large Vision Model

Shengqiong Wu, Lanhu Wu, Mingyang Bao, Wenhao Xu, Hanwang Zhang, Shuicheng Yan, Hao Fei, Tat-Seng Chua

TL;DR

PolyV is presented, a unified LVM that achieves cross-vision synergy at both the architectural and training levels and establishes a unified framework for synesthetic visual reasoning, advancing toward truly synergistic LVMs.

Abstract

Recent advances in large vision models (LVMs) have shifted from modality-specific designs toward unified architectures that jointly process images, videos, and 3D data. However, existing unified LVMs primarily pursue functional integration, while overlooking the deeper goal of cross-vision synergy: the ability to reason over complementary priors across visual modalities. To address this, we present PolyV, a unified LVM that achieves cross-vision synergy at both the architectural and training levels. Architecturally, PolyV adopts a sparse Mixture-of-Experts LVM coordinated by a dynamic modality router, allowing each expert to specialize in modality-specific priors while enabling bidirectional interaction and mutual refinement across modalities. Training-wise, a synergy-aware paradigm combines modality-specific pretraining with coarse-to-fine synergy tuning via knowledge distillation and object-/relation-level alignment. Extensive experiments on 10 benchmarks spanning image, video, and 3D understanding, including synergy-focused datasets requiring spatial or temporal priors, demonstrate that PolyV consistently outperforms existing models, achieving over 10% average improvement over its backbone. Overall, PolyV establishes a unified framework for synesthetic visual reasoning, advancing toward truly synergistic LVMs. Project page: https://sqwu.top/PolyV.

Modeling Cross-vision Synergy for Unified Large Vision Model

TL;DR

PolyV is presented, a unified LVM that achieves cross-vision synergy at both the architectural and training levels and establishes a unified framework for synesthetic visual reasoning, advancing toward truly synergistic LVMs.

Abstract

Recent advances in large vision models (LVMs) have shifted from modality-specific designs toward unified architectures that jointly process images, videos, and 3D data. However, existing unified LVMs primarily pursue functional integration, while overlooking the deeper goal of cross-vision synergy: the ability to reason over complementary priors across visual modalities. To address this, we present PolyV, a unified LVM that achieves cross-vision synergy at both the architectural and training levels. Architecturally, PolyV adopts a sparse Mixture-of-Experts LVM coordinated by a dynamic modality router, allowing each expert to specialize in modality-specific priors while enabling bidirectional interaction and mutual refinement across modalities. Training-wise, a synergy-aware paradigm combines modality-specific pretraining with coarse-to-fine synergy tuning via knowledge distillation and object-/relation-level alignment. Extensive experiments on 10 benchmarks spanning image, video, and 3D understanding, including synergy-focused datasets requiring spatial or temporal priors, demonstrate that PolyV consistently outperforms existing models, achieving over 10% average improvement over its backbone. Overall, PolyV establishes a unified framework for synesthetic visual reasoning, advancing toward truly synergistic LVMs. Project page: https://sqwu.top/PolyV.
Paper Structure (61 sections, 8 equations, 9 figures, 16 tables)

This paper contains 61 sections, 8 equations, 9 figures, 16 tables.

Figures (9)

  • Figure 1: (a) Human perception integrates visual, spatial, and temporal cues synergistically, enabling reasoning across modalities. (b) Examples illustrate such synergy, inferring motion from static images and transferring 3D priors to improve video understanding.
  • Figure 2: An illustration of PolyV, where an MoE architecture is designed to enable synergistic learning across image, video, and 3D modalities. Fire denotes the trainable parameters.
  • Figure 3: Illustration of detailed training stages. Stage-1(-1/2) focuses on enabling model understanding of each vision modality. Stage-2(-1): introduces coarse-grained synergistic learning, where a video and 3D foundation model distill temporal and geometric priors into the MoE-LLM. During this process, the model generates latent synergy tokens wrapped in <synergy>, which are optimized via MSE loss to align with the knowledge extracted from foundation models, thereby fostering cross-modality reasoning.
  • Figure 4: Illustration of cross-vision synergy question-answer pairs. Inspired by wu2025usg, we leverage the universal scene graph constructed from image-video and image-3D (multi-view) to construct the object-/relation-level cross-synergy question-answer pairs, which are then utilized to enable the model to achieve fine-grained cross-vision synergy.
  • Figure 5: Token distribution across experts on different benchmarks, illustrating load balance and routing diversity that reflect PolyV's adaptive expert specialization.
  • ...and 4 more figures