Taming the Tri-Space Tension: ARC-Guided Hallucination Modeling and Control for Text-to-Image Generation

Jianjiang Yang; Ziyan Huang; Yanshu li; Da Peng; Huaiyuan Yao

Taming the Tri-Space Tension: ARC-Guided Hallucination Modeling and Control for Text-to-Image Generation

Jianjiang Yang, Ziyan Huang, Yanshu li, Da Peng, Huaiyuan Yao

TL;DR

This work reframes hallucinations in text-to-image diffusion as trajectory drift within a latent cognitive space, introducing the Hallucination Tri-Space $\\mathcal{T}^3$ and the Alignment Risk Code $\\vec{\\tau}(p,t)$ to quantify tensions across semantic coherence, structural alignment, and knowledge grounding. It then proposes TM-ARC, a light-weight latent-space controller that uses axis-specific corrections to keep generations on the prompt-aligned manifold $\\mathcal{M}_{ideal}$, guided by real-time ARC signals. Through unsupervised ARC clustering, ablation studies, and cross-backbone evaluations, the approach demonstrates reduced hallucinations without sacrificing quality or diversity, and shows strong generalization across backbones like SDXL, SD1.5, PixArt-sigma, and Hunyuan-DiT. Overall, the paper offers an interpretable, tension-aware framework for diagnosing and mitigating misalignments in diffusion-based T2I models, with potential applications to safer and more faithful image synthesis.

Abstract

Despite remarkable progress in image quality and prompt fidelity, text-to-image (T2I) diffusion models continue to exhibit persistent "hallucinations", where generated content subtly or significantly diverges from the intended prompt semantics. While often regarded as unpredictable artifacts, we argue that these failures reflect deeper, structured misalignments within the generative process. In this work, we propose a cognitively inspired perspective that reinterprets hallucinations as trajectory drift within a latent alignment space. Empirical observations reveal that generation unfolds within a multiaxial cognitive tension field, where the model must continuously negotiate competing demands across three key critical axes: semantic coherence, structural alignment, and knowledge grounding. We then formalize this three-axis space as the Hallucination Tri-Space and introduce the Alignment Risk Code (ARC): a dynamic vector representation that quantifies real-time alignment tension during generation. The magnitude of ARC captures overall misalignment, its direction identifies the dominant failure axis, and its imbalance reflects tension asymmetry. Based on this formulation, we develop the TensionModulator (TM-ARC): a lightweight controller that operates entirely in latent space. TM-ARC monitors ARC signals and applies targeted, axis-specific interventions during the sampling process. Extensive experiments on standard T2I benchmarks demonstrate that our approach significantly reduces hallucination without compromising image quality or diversity. This framework offers a unified and interpretable approach for understanding and mitigating generative failures in diffusion-based T2I systems.

Taming the Tri-Space Tension: ARC-Guided Hallucination Modeling and Control for Text-to-Image Generation

TL;DR

Abstract

Taming the Tri-Space Tension: ARC-Guided Hallucination Modeling and Control for Text-to-Image Generation

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (6)