Table of Contents
Fetching ...

EndoCoT: Scaling Endogenous Chain-of-Thought Reasoning in Diffusion Models

Xuanlang Dai, Yujie Zhou, Long Xing, Jiazi Bu, Xilin Wei, Yuhong Liu, Beichen Zhang, Kai Chen, Yuhang Zang

Abstract

Recently, Multimodal Large Language Models (MLLMs) have been widely integrated into diffusion frameworks primarily as text encoders to tackle complex tasks such as spatial reasoning. However, this paradigm suffers from two critical limitations: (i) MLLMs text encoder exhibits insufficient reasoning depth. Single-step encoding fails to activate the Chain-of-Thought process, which is essential for MLLMs to provide accurate guidance for complex tasks. (ii) The guidance remains invariant during the decoding process. Invariant guidance during decoding prevents DiT from progressively decomposing complex instructions into actionable denoising steps, even with correct MLLM encodings. To this end, we propose Endogenous Chain-of-Thought (EndoCoT), a novel framework that first activates MLLMs' reasoning potential by iteratively refining latent thought states through an iterative thought guidance module, and then bridges these states to the DiT's denoising process. Second, a terminal thought grounding module is applied to ensure the reasoning trajectory remains grounded in textual supervision by aligning the final state with ground-truth answers. With these two components, the MLLM text encoder delivers meticulously reasoned guidance, enabling the DiT to execute it progressively and ultimately solve complex tasks in a step-by-step manner. Extensive evaluations across diverse benchmarks (e.g., Maze, TSP, VSP, and Sudoku) achieve an average accuracy of 92.1%, outperforming the strongest baseline by 8.3 percentage points.

EndoCoT: Scaling Endogenous Chain-of-Thought Reasoning in Diffusion Models

Abstract

Recently, Multimodal Large Language Models (MLLMs) have been widely integrated into diffusion frameworks primarily as text encoders to tackle complex tasks such as spatial reasoning. However, this paradigm suffers from two critical limitations: (i) MLLMs text encoder exhibits insufficient reasoning depth. Single-step encoding fails to activate the Chain-of-Thought process, which is essential for MLLMs to provide accurate guidance for complex tasks. (ii) The guidance remains invariant during the decoding process. Invariant guidance during decoding prevents DiT from progressively decomposing complex instructions into actionable denoising steps, even with correct MLLM encodings. To this end, we propose Endogenous Chain-of-Thought (EndoCoT), a novel framework that first activates MLLMs' reasoning potential by iteratively refining latent thought states through an iterative thought guidance module, and then bridges these states to the DiT's denoising process. Second, a terminal thought grounding module is applied to ensure the reasoning trajectory remains grounded in textual supervision by aligning the final state with ground-truth answers. With these two components, the MLLM text encoder delivers meticulously reasoned guidance, enabling the DiT to execute it progressively and ultimately solve complex tasks in a step-by-step manner. Extensive evaluations across diverse benchmarks (e.g., Maze, TSP, VSP, and Sudoku) achieve an average accuracy of 92.1%, outperforming the strongest baseline by 8.3 percentage points.
Paper Structure (30 sections, 13 equations, 24 figures, 7 tables, 4 algorithms)

This paper contains 30 sections, 13 equations, 24 figures, 7 tables, 4 algorithms.

Figures (24)

  • Figure 1: EndoCoT enables endogenous chain-of-thought reasoning.(a) Radar plot showing EndoCoT outperforms baselines across all benchmarks. (b) On visual reasoning tasks requiring generalization (maze size, Sudoku font), previous work he2025diffthinker fails on novel domains while EndoCoT consistently generalizes correctly. (c) Vanilla denoising (left) commits to solutions early without reasoning, while our approach (right) enables interpretable, step-by-step reasoning chains.
  • Figure 2: (a) Layer-wise sensitivity across Vision Encoder, LLM, and DiT components (red: high sensitivity, white: low sensitivity). (b) Limited single-step reasoning: DiT performs spatial grounding but trajectory violates constraints. (c) Static-guidance failure: Dense topologies cause attention entropy to become diffuse.
  • Figure 3: Overview of EndoCoT. (a) Training: We propose a progressive two-stage training strategy: the first stage trains the model to fit both intermediate and final states at each reasoning step, capturing the full multi-step trajectory; the second stage freezes gradients on intermediate states and optimizes only the terminal state, refining generation quality while preserving learned reasoning dynamics (b) Inference: the model iteratively updates latent representations.
  • Figure 4: Overview of notations and iterative thought guidance module. EndoCoT iteratively refines latent states $\mathbf{h}_\tau$ through the MLLM $f_\phi$, then conditions the DiT $f_\psi$ at each reasoning step $\tau$ to generate intermediate visual outputs $\mathbf{I}_\tau$.
  • Figure 5: Step-by-step reasoning process across four distinct tasks. Our model incrementally resolves complex visual reasoning tasks through intermediate reasoning steps. For each task (Maze, TSP, Sudoku, VSP), we show the initial input (leftmost), intermediate refinement steps, and the final optimal solution (rightmost).
  • ...and 19 more figures