Table of Contents
Fetching ...

Improving Chain-of-Thought Efficiency for Autoregressive Image Generation

Zeqi Gu, Markos Georgopoulos, Xiaoliang Dai, Marjan Ghazvininejad, Chu Wang, Felix Juefei-Xu, Kunpeng Li, Yujun Shi, Zecheng He, Zijian He, Jiawei Zhou, Abe Davis, Jialiang Wang

TL;DR

This work tackles the inefficiency of reasoning in chain-of-thought prompts used for autoregressive image generation by introducing ShortCoTI, a reinforcement learning framework that adaptively penalizes CoT length to reduce verbosity. Building on the T2I-R1 baseline, ShortCoTI combines a dynamic length penalty with multiple reward signals to maintain alignment and image quality while shortening the reasoning sequence by about $54\%$. Across GenEval and T2I-CompBench, ShortCoTI achieves equal or improved image fidelity and alignment metrics, reduces inference time, and preserves aesthetics, demonstrating that concise CoT can enhance efficiency without compromising visual output. The study also analyzes prompting templates, CoT necessity, and seed variability, highlighting practical considerations for deploying CoT-efficient multimodal generation in real-world settings.

Abstract

Autoregressive multimodal large language models have recently gained popularity for image generation, driven by advances in foundation models. To enhance alignment and detail, newer approaches employ chain-of-thought (CoT) reasoning, expanding user inputs into elaborated prompts prior to image synthesis. However, this strategy can introduce unnecessary redundancy -- a phenomenon we call visual overthinking -- which increases computational costs and can introduce details that contradict the original prompt. In this work, we explore how to generate more concise CoT sequences for more efficient image generation. We introduce ShortCoTI, a lightweight optimization framework that encourages more concise CoT while preserving output image quality. ShortCoTI rewards more concise prompts with an adaptive function that scales according to an estimated difficulty for each task. Incorporating this reward into a reinforcement learning paradigm reduces prompt reasoning length by 54% while maintaining or slightly improving quality metrics across multiple benchmarks (T2I-CompBench, GenEval). Qualitative analysis shows that our method eliminates verbose explanations and repetitive refinements, producing reasoning prompts that are both concise and semantically rich. As a result, ShortCoTI improves computational efficiency without compromising the fidelity or visual appeal of generated images.

Improving Chain-of-Thought Efficiency for Autoregressive Image Generation

TL;DR

This work tackles the inefficiency of reasoning in chain-of-thought prompts used for autoregressive image generation by introducing ShortCoTI, a reinforcement learning framework that adaptively penalizes CoT length to reduce verbosity. Building on the T2I-R1 baseline, ShortCoTI combines a dynamic length penalty with multiple reward signals to maintain alignment and image quality while shortening the reasoning sequence by about . Across GenEval and T2I-CompBench, ShortCoTI achieves equal or improved image fidelity and alignment metrics, reduces inference time, and preserves aesthetics, demonstrating that concise CoT can enhance efficiency without compromising visual output. The study also analyzes prompting templates, CoT necessity, and seed variability, highlighting practical considerations for deploying CoT-efficient multimodal generation in real-world settings.

Abstract

Autoregressive multimodal large language models have recently gained popularity for image generation, driven by advances in foundation models. To enhance alignment and detail, newer approaches employ chain-of-thought (CoT) reasoning, expanding user inputs into elaborated prompts prior to image synthesis. However, this strategy can introduce unnecessary redundancy -- a phenomenon we call visual overthinking -- which increases computational costs and can introduce details that contradict the original prompt. In this work, we explore how to generate more concise CoT sequences for more efficient image generation. We introduce ShortCoTI, a lightweight optimization framework that encourages more concise CoT while preserving output image quality. ShortCoTI rewards more concise prompts with an adaptive function that scales according to an estimated difficulty for each task. Incorporating this reward into a reinforcement learning paradigm reduces prompt reasoning length by 54% while maintaining or slightly improving quality metrics across multiple benchmarks (T2I-CompBench, GenEval). Qualitative analysis shows that our method eliminates verbose explanations and repetitive refinements, producing reasoning prompts that are both concise and semantically rich. As a result, ShortCoTI improves computational efficiency without compromising the fidelity or visual appeal of generated images.

Paper Structure

This paper contains 24 sections, 5 equations, 7 figures, 7 tables.

Figures (7)

  • Figure 1: We observe that the reasoning CoT prompt in the T2I-R1 t2i-r1 autoregressive image generation model often contains redundant information. To address this, we introduce ShortCoTI, the first approach aimed at improving reasoning efficiency. By incorporating a dynamic length penalty in the RL reward function, we achieve a $54\%$ improvement in reasoning efficiency on T2I-CompBench huang2023t2i, as measured by token length, while also increasing accuracy by $1.14\%$. In this example, our method improves T2I-CompBench score from 85.42 to 96.59.
  • Figure 2: We began by investigating whether we can manually shorten the reasoning prompt while maintaining the quality of images generated by T2I-R1 t2i-r1. We found that we can successfully delete unnecessary sentences in the reasoning prompt while maintaining the generation quality in many cases (columns (c) and (d), where red lines shows what we delete with respect to the original CoT in (b)). However, using an off-the-shelf LLM such as Llama dubey2024llama to shorten the reasoning prompt often cannot maintain the key information useful for image generation, thus leading to a degradation of the image generation quality. This motivates us to post-train the model end-to-end to more intelligently improve CoT efficiency.
  • Figure 3: Trained with reward functions that combines generation accuracy and length penalty, our model achieves concise CoT while preserving, or even improving image quality in some cases. In rows 1-3, baseline T2I-R1 generate hallucinated or incorrect objects and details in the reasoning prompt. Excessive and irrelevant content increases generation difficulty, causing the model to overlook or misrepresent the desired objects and attributes specified in the prompt. In rows 4–6, we show that our training approach enhances the model’s overall ability to follow prompts. Even when the baseline T2I-R1’s CoT does not contain obvious errors, our model achieves higher prompt accuracy and improved image quality. The scores above images are evaluated by the judge models in T2I-CompBench, which correspond to our visual findings. All results are generated with the same seed.
  • Figure 4: (a) Distribution of CoT length. (b) We draw the distribution of different scores for the standard deviation of the generation CoT length across 4 different prompt template for inference-time scaling. With more template to choose from, the score slightly improves. (c) we estimate the difficulty of the shape words in shape subtask of T2I-CompBench by the effects of CoT.
  • Figure 5: Training Statistics of Our 4 Strategies.
  • ...and 2 more figures