Table of Contents
Fetching ...

Uni-cot: Towards Unified Chain-of-Thought Reasoning Across Text and Vision

Luozheng Qin, Jia Gong, Yuqing Sun, Tianjiao Li, Mengping Yang, Xiaomeng Yang, Chao Qu, Zhiyu Tan, Hao Li

TL;DR

Uni-CoT tackles the gap in multimodal chain-of-thought by introducing a two-level hierarchical reasoning framework that separates global planning from subtask execution. The Macro-Level CoT provides high-level task decomposition, while the Micro-Level CoT uses a Markov Decision Process with self-reflection to robustly execute subtasks, all trained with a combination of supervised fine-tuning and reinforcement learning. Empirical results on GenEval, WISE, KRIS, and RISE demonstrate strong open-source performance and improved interpretability, while the architecture remains efficient enough to train on 8 A100 GPUs. This work advances scalable, coherent vision-language reasoning and lays a foundation for future unified multimodal reasoning systems. It also provides practical insights into architecture design, data curation, and training paradigms for complex multimodal tasks.

Abstract

Chain-of-Thought (CoT) reasoning has been widely adopted to enhance Large Language Models (LLMs) by decomposing complex tasks into simpler, sequential subtasks. However, extending CoT to vision-language reasoning tasks remains challenging, as it often requires interpreting transitions of visual states to support reasoning. Existing methods often struggle with this due to limited capacity of modeling visual state transitions or incoherent visual trajectories caused by fragmented architectures. To overcome these limitations, we propose Uni-CoT, a Unified Chain-of-Thought framework that enables coherent and grounded multimodal reasoning within a single unified model. The key idea is to leverage a model capable of both image understanding and generation to reason over visual content and model evolving visual states. However, empowering a unified model to achieve that is non-trivial, given the high computational cost and the burden of training. To address this, Uni-CoT introduces a novel two-level reasoning paradigm: A Macro-Level CoT for high-level task planning and A Micro-Level CoT for subtask execution. This design significantly reduces the computational overhead. Furthermore, we introduce a structured training paradigm that combines interleaved image-text supervision for macro-level CoT with multi-task objectives for micro-level CoT. Together, these innovations allow Uni-CoT to perform scalable and coherent multi-modal reasoning. Furthermore, thanks to our design, all experiments can be efficiently completed using only 8 A100 GPUs with 80GB VRAM each. Experimental results on reasoning-driven image generation benchmark (WISE) and editing benchmarks (RISE and KRIS) indicates that Uni-CoT demonstrates SOTA performance and strong generalization, establishing Uni-CoT as a promising solution for multi-modal reasoning. Project Page and Code: https://sais-fuxi.github.io/projects/uni-cot/

Uni-cot: Towards Unified Chain-of-Thought Reasoning Across Text and Vision

TL;DR

Uni-CoT tackles the gap in multimodal chain-of-thought by introducing a two-level hierarchical reasoning framework that separates global planning from subtask execution. The Macro-Level CoT provides high-level task decomposition, while the Micro-Level CoT uses a Markov Decision Process with self-reflection to robustly execute subtasks, all trained with a combination of supervised fine-tuning and reinforcement learning. Empirical results on GenEval, WISE, KRIS, and RISE demonstrate strong open-source performance and improved interpretability, while the architecture remains efficient enough to train on 8 A100 GPUs. This work advances scalable, coherent vision-language reasoning and lays a foundation for future unified multimodal reasoning systems. It also provides practical insights into architecture design, data curation, and training paradigms for complex multimodal tasks.

Abstract

Chain-of-Thought (CoT) reasoning has been widely adopted to enhance Large Language Models (LLMs) by decomposing complex tasks into simpler, sequential subtasks. However, extending CoT to vision-language reasoning tasks remains challenging, as it often requires interpreting transitions of visual states to support reasoning. Existing methods often struggle with this due to limited capacity of modeling visual state transitions or incoherent visual trajectories caused by fragmented architectures. To overcome these limitations, we propose Uni-CoT, a Unified Chain-of-Thought framework that enables coherent and grounded multimodal reasoning within a single unified model. The key idea is to leverage a model capable of both image understanding and generation to reason over visual content and model evolving visual states. However, empowering a unified model to achieve that is non-trivial, given the high computational cost and the burden of training. To address this, Uni-CoT introduces a novel two-level reasoning paradigm: A Macro-Level CoT for high-level task planning and A Micro-Level CoT for subtask execution. This design significantly reduces the computational overhead. Furthermore, we introduce a structured training paradigm that combines interleaved image-text supervision for macro-level CoT with multi-task objectives for micro-level CoT. Together, these innovations allow Uni-CoT to perform scalable and coherent multi-modal reasoning. Furthermore, thanks to our design, all experiments can be efficiently completed using only 8 A100 GPUs with 80GB VRAM each. Experimental results on reasoning-driven image generation benchmark (WISE) and editing benchmarks (RISE and KRIS) indicates that Uni-CoT demonstrates SOTA performance and strong generalization, establishing Uni-CoT as a promising solution for multi-modal reasoning. Project Page and Code: https://sais-fuxi.github.io/projects/uni-cot/

Paper Structure

This paper contains 37 sections, 4 equations, 9 figures, 7 tables.

Figures (9)

  • Figure 1: The multi-modal reasoning trajectory of Uni-CoT. Uni-CoT extends Chain-of-Thought to the multi-modal domain, enabling a unified model to perform coherent, grounded, and step-by-step reasoning across text and images. More results refer to Figure \ref{['fig:supp_visualization_breakdown']}.
  • Figure 2: Overview of the Uni-CoT framework. Uni-CoT consists of two complementary reasoning branches: (1) Macro-Level CoT, which decomposes a complex task into simpler subtasks and aggregates their outcomes to synthesize the final result. To reduce learning and computational overhead, intra-subtask reasoning is kept implicit. This process is enforced through a macro attention mask that reveals only the system prompt, high-level plans, and subtask outputs. (2) Micro-Level CoT, which executes individual subtasks while filtering out irrelevant information. It is modeled as a Markov Decision Process (MDP), where each reasoning and self-reflection step depends solely on the previous state and the current prompt. This process is enforced through a micro attention mask that restricts visibility to the last state and current instruction. High-resolution depictions of the macro and micro attention masks are shown in Figure \ref{['fig:unicot_mask']}.
  • Figure 3: MDP-based reasoning architecture. (a) Overview of the sequential MDP process for multi-modal reasoning. (b) Architecture of a single MDP step $(s_t, a_t, s_{t+1}, r_{t+1})$. The transition from one state to the next is guided by the subtask instruction, with the learnable content highlighted in pink.
  • Figure 4: Qualitative Results for Reliable Image Generation. Uni-CoT demonstrates impressive image generation capabilities on complex, abstract, and reasoning-intensive prompts. Notably, these results are achieved through joint image-text reasoning, where Uni-CoT iteratively evaluates the current visual state, provides textual instructions for modification, and then executes those modifications.
  • Figure 5: Qualitative Results for Reliable Image Editing. Uni-CoT demonstrates considerable image editing abilities, further supporting the effectiveness of its micro-level CoT reasoning. It can generate textual editing instructions and modify the current visual state accordingly.
  • ...and 4 more figures