Table of Contents
Fetching ...

SwimBird: Eliciting Switchable Reasoning Mode in Hybrid Autoregressive MLLMs

Jintao Tong, Shilin Yan, Hongwei Xue, Xiaojun Tang, Kunyu Shi, Guannan Zhang, Ruixuan Li, Yixiong Zou

TL;DR

SwimBird addresses the core problem of modality mismatch in multimodal reasoning by enabling dynamic switching among text-only, vision-only, and interleaved thinking. It introduces a hybrid autoregressive framework that unifies discrete text token generation with continuous visual embeddings, plus a dynamic visual-thought budget and a three-pattern SFT dataset (SwimBird-SFT-92K) for comprehensive supervision. Empirical results demonstrate state-of-the-art performance on both text-centric reasoning and vision-dense perception benchmarks, surpassing fixed-pattern multimodal methods. The approach offers a scalable path toward robust, query-adaptive multimodal reasoning with practical implications for vision-language systems.

Abstract

Multimodal Large Language Models (MLLMs) have made remarkable progress in multimodal perception and reasoning by bridging vision and language. However, most existing MLLMs perform reasoning primarily with textual CoT, which limits their effectiveness on vision-intensive tasks. Recent approaches inject a fixed number of continuous hidden states as "visual thoughts" into the reasoning process and improve visual performance, but often at the cost of degraded text-based logical reasoning. We argue that the core limitation lies in a rigid, pre-defined reasoning pattern that cannot adaptively choose the most suitable thinking modality for different user queries. We introduce SwimBird, a reasoning-switchable MLLM that dynamically switches among three reasoning modes conditioned on the input: (1) text-only reasoning, (2) vision-only reasoning (continuous hidden states as visual thoughts), and (3) interleaved vision-text reasoning. To enable this capability, we adopt a hybrid autoregressive formulation that unifies next-token prediction for textual thoughts with next-embedding prediction for visual thoughts, and design a systematic reasoning-mode curation strategy to construct SwimBird-SFT-92K, a diverse supervised fine-tuning dataset covering all three reasoning patterns. By enabling flexible, query-adaptive mode selection, SwimBird preserves strong textual logic while substantially improving performance on vision-dense tasks. Experiments across diverse benchmarks covering textual reasoning and challenging visual understanding demonstrate that SwimBird achieves state-of-the-art results and robust gains over prior fixed-pattern multimodal reasoning methods.

SwimBird: Eliciting Switchable Reasoning Mode in Hybrid Autoregressive MLLMs

TL;DR

SwimBird addresses the core problem of modality mismatch in multimodal reasoning by enabling dynamic switching among text-only, vision-only, and interleaved thinking. It introduces a hybrid autoregressive framework that unifies discrete text token generation with continuous visual embeddings, plus a dynamic visual-thought budget and a three-pattern SFT dataset (SwimBird-SFT-92K) for comprehensive supervision. Empirical results demonstrate state-of-the-art performance on both text-centric reasoning and vision-dense perception benchmarks, surpassing fixed-pattern multimodal methods. The approach offers a scalable path toward robust, query-adaptive multimodal reasoning with practical implications for vision-language systems.

Abstract

Multimodal Large Language Models (MLLMs) have made remarkable progress in multimodal perception and reasoning by bridging vision and language. However, most existing MLLMs perform reasoning primarily with textual CoT, which limits their effectiveness on vision-intensive tasks. Recent approaches inject a fixed number of continuous hidden states as "visual thoughts" into the reasoning process and improve visual performance, but often at the cost of degraded text-based logical reasoning. We argue that the core limitation lies in a rigid, pre-defined reasoning pattern that cannot adaptively choose the most suitable thinking modality for different user queries. We introduce SwimBird, a reasoning-switchable MLLM that dynamically switches among three reasoning modes conditioned on the input: (1) text-only reasoning, (2) vision-only reasoning (continuous hidden states as visual thoughts), and (3) interleaved vision-text reasoning. To enable this capability, we adopt a hybrid autoregressive formulation that unifies next-token prediction for textual thoughts with next-embedding prediction for visual thoughts, and design a systematic reasoning-mode curation strategy to construct SwimBird-SFT-92K, a diverse supervised fine-tuning dataset covering all three reasoning patterns. By enabling flexible, query-adaptive mode selection, SwimBird preserves strong textual logic while substantially improving performance on vision-dense tasks. Experiments across diverse benchmarks covering textual reasoning and challenging visual understanding demonstrate that SwimBird achieves state-of-the-art results and robust gains over prior fixed-pattern multimodal reasoning methods.
Paper Structure (14 sections, 5 equations, 6 figures, 5 tables)

This paper contains 14 sections, 5 equations, 6 figures, 5 tables.

Figures (6)

  • Figure 1: SwimBird enables query-adaptive multimodal reasoning by dynamically switching among text-only, vision-only, and interleaved vision--text modes. As illustrated, it avoids redundant latent steps on text-centric queries (Case 1), relies on latent visual thoughts for vision-dense spatial problems (Case 2), and interleaves visual grounding with textual deduction when both are needed (Case 3), mitigating modality mismatch and improving robustness.
  • Figure 2: SwimBird adopts a hybrid autoregressive formulation that performs next-token prediction for textual thoughts and switches to next-embedding prediction for visual thoughts. During inference, SwimBird performs query-adaptive multimodal reasoning by dynamically selecting among three modes conditioned on the input: text-only, vision-only, and interleaved vision-text reasoning.
  • Figure 3: Resolution-aware, dynamic latent tokens budget.
  • Figure 4: Distribution of reasoning mode across different benchmarks for SwimBird.
  • Figure 5: Analysis of Different Reasoning-Mode Case.
  • ...and 1 more figures