StruVis: Enhancing Reasoning-based Text-to-Image Generation via Thinking with Structured Vision

Yuanhuiyi Lyu; Kaiyu Lei; Ziqiao Weng; Xu Zheng; Lutao Jiang; Teng Li; Yangfu Li; Ziyuan Huang; Linfeng Zhang; Xuming Hu

StruVis: Enhancing Reasoning-based Text-to-Image Generation via Thinking with Structured Vision

Yuanhuiyi Lyu, Kaiyu Lei, Ziqiao Weng, Xu Zheng, Lutao Jiang, Teng Li, Yangfu Li, Ziyuan Huang, Linfeng Zhang, Xuming Hu

TL;DR

StruVis, a novel framework that enhances T2I generation through Thinking with Structured Vision, employs text-based structured visual representations as intermediate reasoning states, thereby enabling the MLLM to effectively"perceive"visual structure within a purely text-based reasoning process.

Abstract

Reasoning-based text-to-image (T2I) generation requires models to interpret complex prompts accurately. Existing reasoning frameworks can be broadly categorized into two types: (1) Text-Only Reasoning, which is computationally efficient but lacks access to visual context, often resulting in the omission of critical spatial and visual elements; and (2) Text-Image Interleaved Reasoning, which leverages a T2I generator to provide visual references during the reasoning process. While this approach enhances visual grounding, it incurs substantial computational costs and constrains the reasoning capacity of MLLMs to the representational limitations of the generator. To this end, we propose StruVis, a novel framework that enhances T2I generation through Thinking with Structured Vision. Instead of relying on intermediate image generation, StruVis employs text-based structured visual representations as intermediate reasoning states, thereby enabling the MLLM to effectively "perceive" visual structure within a purely text-based reasoning process. Powered by this, the reasoning potential for T2I generation of the MLLM is unlocked through structured-vision-guided reasoning. Additionally, as a generator-agnostic reasoning framework, our proposed StruVis can be seamlessly integrated with diverse T2I generators and efficiently enhance their performance in reasoning-based T2I generation. Extensive experiments demonstrate that StruVis achieves significant performance improvements on reasoning-based T2I benchmarks, e.g., a 4.61% gain on T2I-ReasonBench and a 4% gain on WISE.

StruVis: Enhancing Reasoning-based Text-to-Image Generation via Thinking with Structured Vision

TL;DR

Abstract

Paper Structure (17 sections, 8 equations, 6 figures, 3 tables)

This paper contains 17 sections, 8 equations, 6 figures, 3 tables.

Introduction
Related Work
RL-based Reasoning
Reasoning-based T2I Generation
Methodology
Problem Formulation
Data Construction
The Proposed StruVis Framework
Implementation
Experiments
Reasoning-based T2I Generation Benchmarks
Quantitative Results
Qualitative Results
Ablation Study
Ablation of Reward Functions
...and 2 more sections

Figures (6)

Figure 1: The overall of (a) Text-Only Reasoning, (b) Text-Image Interleaved Reasoning, and (c) Our Thinking with Structured Vision.
Figure 2: The overall of the data pipeline for collecting our StruVis-CoT data.
Figure 3: The overall of the GRPO training stage. We design three reward functions to train our StruVis, including format, understanding, and image rewards.
Figure 4: The visual comparison of our proposed StruVis and the baselines. We show the final generated results on the T2I-ReasonBench.
Figure 5: The visual comparison of our proposed StruVis and the baselines. We show the final generated results on the WISE Benchmark.
...and 1 more figures

StruVis: Enhancing Reasoning-based Text-to-Image Generation via Thinking with Structured Vision

TL;DR

Abstract

StruVis: Enhancing Reasoning-based Text-to-Image Generation via Thinking with Structured Vision

Authors

TL;DR

Abstract

Table of Contents

Figures (6)