Table of Contents
Fetching ...

Taming the Tri-Space Tension: ARC-Guided Hallucination Modeling and Control for Text-to-Image Generation

Jianjiang Yang, Ziyan Huang, Yanshu li, Da Peng, Huaiyuan Yao

TL;DR

This work reframes hallucinations in text-to-image diffusion as trajectory drift within a latent cognitive space, introducing the Hallucination Tri-Space $\\mathcal{T}^3$ and the Alignment Risk Code $\\vec{\\tau}(p,t)$ to quantify tensions across semantic coherence, structural alignment, and knowledge grounding. It then proposes TM-ARC, a light-weight latent-space controller that uses axis-specific corrections to keep generations on the prompt-aligned manifold $\\mathcal{M}_{ideal}$, guided by real-time ARC signals. Through unsupervised ARC clustering, ablation studies, and cross-backbone evaluations, the approach demonstrates reduced hallucinations without sacrificing quality or diversity, and shows strong generalization across backbones like SDXL, SD1.5, PixArt-sigma, and Hunyuan-DiT. Overall, the paper offers an interpretable, tension-aware framework for diagnosing and mitigating misalignments in diffusion-based T2I models, with potential applications to safer and more faithful image synthesis.

Abstract

Despite remarkable progress in image quality and prompt fidelity, text-to-image (T2I) diffusion models continue to exhibit persistent "hallucinations", where generated content subtly or significantly diverges from the intended prompt semantics. While often regarded as unpredictable artifacts, we argue that these failures reflect deeper, structured misalignments within the generative process. In this work, we propose a cognitively inspired perspective that reinterprets hallucinations as trajectory drift within a latent alignment space. Empirical observations reveal that generation unfolds within a multiaxial cognitive tension field, where the model must continuously negotiate competing demands across three key critical axes: semantic coherence, structural alignment, and knowledge grounding. We then formalize this three-axis space as the Hallucination Tri-Space and introduce the Alignment Risk Code (ARC): a dynamic vector representation that quantifies real-time alignment tension during generation. The magnitude of ARC captures overall misalignment, its direction identifies the dominant failure axis, and its imbalance reflects tension asymmetry. Based on this formulation, we develop the TensionModulator (TM-ARC): a lightweight controller that operates entirely in latent space. TM-ARC monitors ARC signals and applies targeted, axis-specific interventions during the sampling process. Extensive experiments on standard T2I benchmarks demonstrate that our approach significantly reduces hallucination without compromising image quality or diversity. This framework offers a unified and interpretable approach for understanding and mitigating generative failures in diffusion-based T2I systems.

Taming the Tri-Space Tension: ARC-Guided Hallucination Modeling and Control for Text-to-Image Generation

TL;DR

This work reframes hallucinations in text-to-image diffusion as trajectory drift within a latent cognitive space, introducing the Hallucination Tri-Space and the Alignment Risk Code to quantify tensions across semantic coherence, structural alignment, and knowledge grounding. It then proposes TM-ARC, a light-weight latent-space controller that uses axis-specific corrections to keep generations on the prompt-aligned manifold , guided by real-time ARC signals. Through unsupervised ARC clustering, ablation studies, and cross-backbone evaluations, the approach demonstrates reduced hallucinations without sacrificing quality or diversity, and shows strong generalization across backbones like SDXL, SD1.5, PixArt-sigma, and Hunyuan-DiT. Overall, the paper offers an interpretable, tension-aware framework for diagnosing and mitigating misalignments in diffusion-based T2I models, with potential applications to safer and more faithful image synthesis.

Abstract

Despite remarkable progress in image quality and prompt fidelity, text-to-image (T2I) diffusion models continue to exhibit persistent "hallucinations", where generated content subtly or significantly diverges from the intended prompt semantics. While often regarded as unpredictable artifacts, we argue that these failures reflect deeper, structured misalignments within the generative process. In this work, we propose a cognitively inspired perspective that reinterprets hallucinations as trajectory drift within a latent alignment space. Empirical observations reveal that generation unfolds within a multiaxial cognitive tension field, where the model must continuously negotiate competing demands across three key critical axes: semantic coherence, structural alignment, and knowledge grounding. We then formalize this three-axis space as the Hallucination Tri-Space and introduce the Alignment Risk Code (ARC): a dynamic vector representation that quantifies real-time alignment tension during generation. The magnitude of ARC captures overall misalignment, its direction identifies the dominant failure axis, and its imbalance reflects tension asymmetry. Based on this formulation, we develop the TensionModulator (TM-ARC): a lightweight controller that operates entirely in latent space. TM-ARC monitors ARC signals and applies targeted, axis-specific interventions during the sampling process. Extensive experiments on standard T2I benchmarks demonstrate that our approach significantly reduces hallucination without compromising image quality or diversity. This framework offers a unified and interpretable approach for understanding and mitigating generative failures in diffusion-based T2I systems.

Paper Structure

This paper contains 47 sections, 12 equations, 6 figures, 7 tables.

Figures (6)

  • Figure 1: Visualizing hallucination as trajectory drift in latent alignment space in T2I models. Successful generations (middle row) follow a coherent sampling trajectory from noise to data, remaining close to the prompt intent. Hallucinatory generations (bottom row) exhibit tension drift, leading to semantic and structural deviations.
  • Figure 2: Unsupervised clustering of alignment deviation vectors in the Hallucination Tri-Space. (a) 3D t-SNE embedding of SC/SA/KG drift magnitudes shows three discernible clusters with realistic overlap, each corresponding to one dominant misalignment axis. (b) 2D projection of the same 3D embedding preserves the overall cluster structure despite mild inter-cluster mixing. (c) A truly random noise baseline yields no meaningful grouping, confirming that the observed clustering arises from structured alignment tensions rather than chance.
  • Figure 3: The figure illustrates how imbalanced semantic, structural, and knowledge tensions drive trajectory drift in T2I generation. The Alignment Risk Code (ARC) captures real-time multiaxial tension, enabling interpretable modeling and dynamic hallucination mitigation.
  • Figure 4: ARC Dynamics. (top) Total tension $\|\vec{\tau}\|$ across timesteps for two example prompts; (bottom) Component-wise trajectories showing tension concentration patterns that predict semantic vs. structural hallucinations.
  • Figure 5: Overview of hallucination modeling and ARC-guided control in T2I generation.Given a prompt $p$, the T2I model undergoes iterative denoising, where semantic (SC), structural (SA), and knowledge (KG) alignment tensions dynamically evolve within the Hallucination Tri-Space $\mathcal{T}^3$. Misregulated tension leads to a trajectory drift $\Delta t$ from the ideal generative path. The Alignment Risk Code (ARC) encodes real-time multiaxial tension and guides the Tension Modulator (TM-ARC) to inject adaptive controls via SC-Gate, SA-Tuner, and KG-Aug modules for hallucination mitigation.
  • ...and 1 more figures